Tiny Orchestrator LLM Slashes Costs, Outperforms GPT-5 on Complex Reasoning Tasks
A new architecture for artificial intelligence agents, dubbed ToolOrchestra, has demonstrated that smaller, strategically trained models can surpass the performance of frontier large language models (LLMs) while dramatically cutting computational costs.
Introduced by NVIDIA researchers, ToolOrchestra is a novel method for training a lightweight LLM—the 8-billion-parameter “Orchestrator”—to act as a strategic central intelligence. This Orchestrator manages a diverse toolkit, which crucially includes powerful external resources, specialized LLMs, and even generalist models like GPT-5, delegating sub-tasks only when necessary.
The research directly challenges the prevailing assumption that complex agentic tasks require ever-larger, monolithic models.
Overcoming Monolithic Inefficiency
Current high-performance LLMs, while capable of tool use (such as calling a web search API or a code interpreter), often struggle with complex, multi-step problems like those found in advanced research or mathematical proofs. Furthermore, when these models are simply prompted to manage an expanded toolkit (including other, specialized LLMs), they exhibit rigid and costly biases.
For example, experiments showed that off-the-shelf LLMs acting as orchestrators fail to be strategic. The Qwen3-8B model, when prompted to use tools, disproportionately deferred the task to the expensive GPT-5 nearly three-quarters of the time, regardless of cost. Similarly, GPT-5 acting as an orchestrator showed a strong bias toward its own cheaper variant, GPT-5-mini, even when specialized tools might have been better.
ToolOrchestra overcomes this by training the Orchestrator end-to-end using Reinforcement Learning (RL) guided by a multi-objective reward system. This reward system simultaneously optimizes for three factors: the correctness of the final outcome, resource efficiency (minimizing cost and latency), and adherence to user preferences (e.g., favoring open-source models).
This strategic training enables the small Orchestrator model to dynamically decide which tool offers the best performance-cost trade-off for any given step in a multi-turn reasoning process.
State-of-the-Art Performance at a Fraction of the Price
The effectiveness of this orchestration strategy was validated across three challenging reasoning benchmarks, including the “Humanity’s Last Exam” (HLE), which consists of complex, PhD-level questions across diverse scientific and humanitarian disciplines.
The 8B Orchestrator achieved an accuracy score of 37.1% on HLE, significantly outperforming a tool-using GPT-5 baseline, which scored 35.1%.
The efficiency gains were even more pronounced. The Orchestrator was 2.5 times more cost-effective on HLE compared to GPT-5. On other benchmarks like FRAMES (a factuality reasoning test) and T²-Bench (a function-calling conversation test), the Orchestrator achieved better overall performance while using only about 30% of the computational cost of leading monolithic models.
These results confirm that the intelligence of a composite system, efficiently managed by a small, specialized orchestrator, is superior to reliance on a single, massive LLM. The method paves the way for the development of practical, highly scalable, and user-controllable tool-augmented AI reasoning systems.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.