ReTool: AI Learns to Ace Math Olympiads by Strategically Using Code
A team at ByteDance Seed has unveiled “ReTool,” a novel reinforcement learning (RL) framework that significantly enhances the mathematical reasoning abilities of large language models (LLMs) by strategically integrating code execution. The research, outlined in a paper published on April 15, 2025, addresses a critical limitation of current AI reasoning models: their struggles with structured problem-solving requiring precise computation, like those found in Math Olympiads.
The core idea behind ReTool is to enable LLMs to dynamically interleave natural language reasoning with real-time code execution, leveraging the strengths of both approaches. Imagine an LLM tackling a complex geometry problem. Instead of relying solely on textual reasoning, it can now generate Python code snippets to perform precise calculations or geometric constructions using a code interpreter. The output of this code is then fed back into the LLM, informing its subsequent reasoning steps.
ReTool’s success lies in its two-pronged approach:
- Cold-Start Data Generation: The researchers developed a pipeline to create a high-quality dataset demonstrating when and how to invoke a code interpreter within long-form reasoning. This provides the LLM with an initial understanding of tool use. Think of it as providing the AI with solved examples demonstrating effective use of a calculator.
- Reinforcement Learning with Outcome Feedback: The LLM is trained using reinforcement learning, where the reward is based on the outcome of the problem-solving process. This encourages the model to autonomously discover optimal patterns for using the code interpreter, similar to how a human learns to use tools more effectively over time through trial and error.
Experiments on the challenging MATH Olympiad benchmark (AIME) showed remarkable results. A 32-billion parameter ReTool model achieved 67% accuracy with only 400 training steps. Remarkably, in extended testing, ReTool attained 72.5% accuracy, surpassing OpenAI’s o1-preview model by 27.9%. This significant performance boost demonstrates the effectiveness of integrating code execution into the reasoning process.
The researchers also uncovered intriguing emergent behaviors during training, such as the LLM developing a “code self-correction” capability. The model could identify errors in its generated code and autonomously revise it. The analysis shows that the models learned to invoke tools appropriately, select tools adaptively, and structure tool calls effectively.
For example, consider a problem that requires finding the number of values for which a greedy algorithm succeeds. A standard LLM might perform laborious and error-prone text-based calculations, while ReTool can generate code to test a range of values rapidly, leading to a more accurate and efficient solution.
The ReTool framework demonstrates the promise of outcome-driven tool integration for advancing complex mathematical reasoning in LLMs and suggests new avenues for exploring hybrid neuro-symbolic systems. The project page for ReTool is available at https://retool-rl.github.io/.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.