AI Papers Reader

Personalized digests of latest AI research

View on GitHub

New Reward Model Outperforms Existing Ones in Evaluating AI Tool Use

IBM Research has introduced a novel reward modeling framework called ToolRM that significantly improves how Large Language Models (LLMs) are evaluated when they interact with external tools. This research addresses a critical gap in current AI evaluation, particularly as LLMs are increasingly tasked with complex workflows that involve calling APIs, databases, and other software functions.

Existing reward models, often trained on natural language generation, struggle to accurately assess the effectiveness of tool-based reasoning and execution. To tackle this, the researchers first developed FC-RewardBench, the first benchmark specifically designed to evaluate reward models in these tool-calling scenarios. Their analysis using this benchmark revealed that standard reward models frequently miss key signals of successful tool use, underscoring the need for domain-specific solutions.

To address this limitation, IBM Research developed ToolRM, a framework for training outcome-based reward models (ORMs). These ORMs are trained using data synthesized from open-weight LLMs. The researchers trained ToolRM models ranging in size from 1.7 billion to 14 billion parameters.

Key findings and contributions from the paper include:

  • FC-RewardBench: A new benchmark with 1500 user inputs, tool catalogs, and correct/incorrect tool calls, designed to systematically assess reward models on tool-calling tasks. This benchmark has shown a strong correlation with downstream task performance.
  • ToolRM Framework: A novel method for training outcome reward models specifically for tool-calling tasks. This framework utilizes data generated from permissively licensed, medium-sized, open-weight LLMs.
  • Superior Performance: ToolRM models consistently outperformed general-purpose baselines on FC-RewardBench. In downstream applications, ToolRM demonstrated an average improvement of up to 25% in task performance when used in a “Best-of-n” sampling setting.
  • Data Efficiency: ToolRM enables data-efficient fine-tuning by filtering training data, allowing for better model performance with less data.
  • Benchmarking Results: The research showed that ToolRM achieves the highest accuracy on the FC-RewardBench benchmark while remaining computationally efficient. It also demonstrated that using ToolRM for data filtering leads to superior performance in fine-tuning tool-use models, even with reduced training data.

Concrete examples illustrating the problem and solution:

Imagine an AI assistant tasked with planning a trip. It needs to use a “flight search” tool to find available flights and a “hotel booking” tool to secure accommodation. A standard reward model might only check if the AI generated a coherent description of the trip. However, it wouldn’t necessarily know if the AI correctly identified the parameters for the flight search (e.g., destination, dates, number of passengers) or if the hotel booking tool was called with the right hotel name and check-in/check-out dates.

ToolRM, on the other hand, is trained to understand the structure and requirements of these tool calls. For instance, if the AI incorrectly specifies the “number of passengers” as “two adults and one child” when the flight search tool only accepts a single integer for the total number of passengers, ToolRM can identify this as an error. Similarly, if the AI forgets to include the required “check-in date” for the hotel booking, ToolRM will penalize this mistake.

The paper highlights that even smaller ToolRM models, like the 1.5 billion parameter version, can provide substantial gains, making them valuable even in resource-constrained environments. The research also indicates that while large language models acting as judges can achieve high accuracy, they come with significant computational costs, making specialized reward models like ToolRM a more efficient and practical solution.

This work by IBM Research represents a significant step forward in enabling more reliable and capable AI systems that can effectively leverage external tools to perform complex tasks.