New AI Technique Boosts Conversational Skills of Tool-Using Language Models
Large Language Models (LLMs) equipped with tools, like those used to book flights or calculate distances, are becoming increasingly common. However, these Tool-Augmented LLMs (TA-LLMs) often struggle with incomplete user requests or situations where no suitable tool exists. A new research paper introduces “DiaTool-DPO,” a novel approach to enhance the conversational capabilities of TA-LLMs, allowing them to handle diverse real-world scenarios more effectively.
The core idea behind DiaTool-DPO is to train the TA-LLM to navigate conversations more intelligently. The researchers model TA-LLM interactions as a Markov Decision Process, a mathematical framework for decision-making in sequential steps. They define five distinct “states” within a conversation:
- Initial State: The starting point, where the user makes a request.
- Tool Selected without Complete Slots: The LLM has identified the correct tool but needs more information (e.g., knowing it needs a “flight booking” tool but missing the destination).
- Tool Selected with Complete Slots: The LLM has all necessary information to use the tool (e.g., destination, date, and time for the flight).
- Wait for Tool Response: The LLM is waiting for the tool to provide results.
- Complete: The LLM has the results and is ready to provide the user with a final answer.
The researchers also categorize user queries into three types based on the expected conversation flow:
- Type 1: All necessary information is provided upfront (e.g., “Book a flight from New York to London on July 10th”).
- Type 2: Some information is missing, requiring the LLM to ask follow-up questions (e.g., “Book a flight”).
- Type 3: No suitable tool exists for the request (e.g., “Write a poem about my cat”).
DiaTool-DPO uses a training process called Direct Preference Optimization (DPO). The model is presented with pairs of conversation “trajectories” – one “correct” path and one “incorrect” path. For example, if a user asks “What’s the weather?” and the LLM starts booking a flight (an “incorrect” trajectory), that’s penalized. Conversely, asking “What city are you interested in?” (leading to a “correct” trajectory) is rewarded.
One key innovation is a specialized objective loss function that guides the LLM towards desirable conversational behaviors, considering whether to ask follow-up questions or reject tool calls when no tool is available.
The team created a dataset called DiaTool-DPO consisting of these paired trajectories, enabling TA-LLMs to learn which conversation flow is best in various scenarios. The study showed significant improvements, with the DiaTool-DPO approach approaching the performance of GPT-4 in specific areas. For instance, performance in gathering necessary information jumped from a baseline of 44% to 94.8%, and correct tool call rejection improved from 9.6% to 91%.
This research marks a significant step towards building more robust and user-friendly TA-LLMs, capable of handling the nuances of real-world conversations without requiring extensive human labeling or expert demonstrations.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.