The Tool Decathlon: Pushing Language Agents to Perform Complex Real-World Tasks
A new benchmark called TOOLATHLON has been introduced to rigorously evaluate language agents on their ability to handle diverse, realistic, and long-horizon tasks across multiple applications. This benchmark aims to address the limitations of existing evaluations, which often focus on narrow domains or simplified scenarios, failing to capture the complexities of real-world agent performance.
TOOLATHLON comprises 32 different software applications and 604 tools, spanning a wide array of domains from everyday platforms like Google Calendar and Notion to professional tools such as WooCommerce, Kubernetes, and BigQuery. The benchmark’s tasks are designed to mimic genuine user requests, often presented with fuzzy or ambiguous instructions that require agents to infer intent and devise their own plans.
A key innovation of TOOLATHLON is its emphasis on realistic environment setups. Instead of using simplified or artificial states, the benchmark initializes tasks with realistic data, such as actual Canvas courses populated with student data or financial spreadsheets mirroring real-world scenarios. This approach, detailed in the paper, aims to present agents with challenges similar to those they would encounter in practical applications.
To ensure reliable evaluation, each task in TOOLATHLON is equipped with a dedicated, verifiable script. This allows for deterministic assessment of task success by comparing outcomes against ground truth states. The benchmark also supports safe and efficient parallel evaluation within containers, significantly speeding up the testing process for large language models.
The researchers evaluated several state-of-the-art language models on TOOLATHLON. The results revealed a significant gap between current model capabilities and the demands of these complex tasks. The best-performing proprietary model, Claude-4.5-Sonnet, achieved only a 38.6% success rate. Among open-source models, DeepSeek-V3.2-Exp reached a 20.1% success rate. These figures highlight the substantial challenges that remain in developing robust and capable language agents for real-world, long-horizon task execution.
The paper details the benchmark’s design, including its extensive list of MCP (Model Context Protocol) servers, locally containerized environments, and a robust evaluation framework. It also delves into the specific challenges encountered by models, such as handling long contexts, dealing with tool call errors, and the impact of overlong tool outputs.
TOOLATHLON is now publicly available, with the goal of fostering the development of more advanced and practical language agents that can effectively navigate the complexities of real-world applications.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.