New Algorithm ARPO Boosts LLM Reasoning with Tool Use and Entropy
Large language models (LLMs) are showing impressive capabilities in complex reasoning tasks, especially when they can leverage external tools like search engines or code interpreters. However, effectively training these LLMs to handle multi-turn interactions with these tools has been a challenge. A new reinforcement learning approach called Agentic Reinforced Policy Optimization (ARPO) aims to bridge this gap by intelligently exploring how LLMs use tools, leading to significant performance gains with reduced computational cost.
The core innovation of ARPO lies in its ability to dynamically adapt its exploration strategy based on the LLM’s behavior after using a tool. Researchers observed that when LLMs interact with external tools, they often exhibit increased uncertainty, indicated by a higher entropy in the distribution of generated tokens. This uncertainty suggests that the LLM is exploring new reasoning paths but may not be fully capitalizing on these opportunities.
ARPO tackles this by implementing an entropy-based adaptive rollout mechanism. Instead of uniformly sampling all possible tool-use sequences, ARPO first samples globally and then, when it detects high token entropy after a tool call, it “branches out” to explore more specific, potentially more informative, tool-use trajectories. This targeted exploration helps the LLM discover more effective ways to use tools in multi-turn reasoning. For instance, if an LLM is asked to find information about a historical event and uses a search engine, ARPO might observe a spike in the LLM’s uncertainty about the next step. Instead of just continuing with the most likely next token, ARPO would then encourage the LLM to explore alternative search queries or ways to process the search results.
To further refine this process, ARPO also incorporates advantage attribution estimation. This mechanism helps the LLM understand which specific tool-use steps contributed most to a successful outcome. By assigning different “advantages” (rewards) to different parts of the reasoning process, the LLM can learn to internalize these step-level improvements more effectively.
The effectiveness of ARPO was demonstrated across 13 challenging benchmarks spanning computational reasoning, knowledge reasoning, and deep search domains. The results show that ARPO consistently outperforms existing trajectory-level reinforcement learning algorithms. Crucially, ARPO achieves this improved performance while using only half the tool-use budget, making it a more efficient and scalable solution for training sophisticated LLM agents. For example, in deep search tasks, ARPO helped smaller LLM models achieve performance comparable to much larger models, showcasing its ability to maximize learning from limited resources. This research suggests a promising direction for building more capable and efficient AI agents that can effectively interact with the real world through tool use.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.