AI Papers Reader

Personalized digests of latest AI research

View on GitHub

New AI Technique Boosts Conversational Skills of Tool-Using Language Models

Large Language Models (LLMs) equipped with tools, like those used to book flights or calculate distances, are becoming increasingly common. However, these Tool-Augmented LLMs (TA-LLMs) often struggle with incomplete user requests or situations where no suitable tool exists. A new research paper introduces “DiaTool-DPO,” a novel approach to enhance the conversational capabilities of TA-LLMs, allowing them to handle diverse real-world scenarios more effectively.

The core idea behind DiaTool-DPO is to train the TA-LLM to navigate conversations more intelligently. The researchers model TA-LLM interactions as a Markov Decision Process, a mathematical framework for decision-making in sequential steps. They define five distinct “states” within a conversation:

  1. Initial State: The starting point, where the user makes a request.
  2. Tool Selected without Complete Slots: The LLM has identified the correct tool but needs more information (e.g., knowing it needs a “flight booking” tool but missing the destination).
  3. Tool Selected with Complete Slots: The LLM has all necessary information to use the tool (e.g., destination, date, and time for the flight).
  4. Wait for Tool Response: The LLM is waiting for the tool to provide results.
  5. Complete: The LLM has the results and is ready to provide the user with a final answer.

The researchers also categorize user queries into three types based on the expected conversation flow:

  • Type 1: All necessary information is provided upfront (e.g., “Book a flight from New York to London on July 10th”).
  • Type 2: Some information is missing, requiring the LLM to ask follow-up questions (e.g., “Book a flight”).
  • Type 3: No suitable tool exists for the request (e.g., “Write a poem about my cat”).

DiaTool-DPO uses a training process called Direct Preference Optimization (DPO). The model is presented with pairs of conversation “trajectories” – one “correct” path and one “incorrect” path. For example, if a user asks “What’s the weather?” and the LLM starts booking a flight (an “incorrect” trajectory), that’s penalized. Conversely, asking “What city are you interested in?” (leading to a “correct” trajectory) is rewarded.

One key innovation is a specialized objective loss function that guides the LLM towards desirable conversational behaviors, considering whether to ask follow-up questions or reject tool calls when no tool is available.

The team created a dataset called DiaTool-DPO consisting of these paired trajectories, enabling TA-LLMs to learn which conversation flow is best in various scenarios. The study showed significant improvements, with the DiaTool-DPO approach approaching the performance of GPT-4 in specific areas. For instance, performance in gathering necessary information jumped from a baseline of 44% to 94.8%, and correct tool call rejection improved from 9.6% to 91%.

This research marks a significant step towards building more robust and user-friendly TA-LLMs, capable of handling the nuances of real-world conversations without requiring extensive human labeling or expert demonstrations.