SimpleTIR: Taming Instability in Large Language Model Reasoning with Tool Interaction
Large Language Models (LLMs) have shown promise in tackling complex reasoning tasks by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, training LLMs for multi-turn TIR using reinforcement learning (RL) has been plagued by instability and performance collapse. A new paper, “SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning,” introduces a novel approach to stabilize this training process and unlock more sophisticated reasoning capabilities.
The core problem, according to the researchers, lies in the distributional shift introduced by external tool feedback. When LLMs receive feedback from tools like Python interpreters or search engines, this information can sometimes deviate from the model’s learned patterns. This “out-of-distribution” input can lead the LLM to generate low-probability tokens, which, when fed back into subsequent turns, further exacerbate the shift. This compounding effect can lead to catastrophic gradient norm explosions, derailing the training.
To combat this, SimpleTIR proposes a straightforward yet effective solution: identifying and filtering out “void turns.” A void turn is defined as a turn where the LLM’s response fails to produce either a complete code block or a final answer. These often manifest as incomplete code, repetitive text, or premature generation of an end-of-sequence token. By simply excluding trajectories containing these void turns from the policy update, SimpleTIR effectively prunes the problematic samples that cause gradient explosions.
The researchers demonstrated SimpleTIR’s effectiveness on challenging math reasoning benchmarks. When starting with the Qwen2.5-7B base model, which initially achieved a score of 22.1 on the AIME24 benchmark with text-only reasoning, SimpleTIR boosted this score to an impressive 50.5. This significant improvement highlights the power of stable multi-turn TIR.
Beyond stabilization, SimpleTIR also fosters the emergence of diverse and sophisticated reasoning patterns. Unlike methods that rely on “cold-start” supervised fine-tuning, which can impose rigid reasoning structures, SimpleTIR’s zero-shot reinforcement learning approach encourages the model to discover novel strategies. These include self-correction, where the model identifies and rectifies its own errors, and cross-validation, where it uses multiple approaches to confirm a result. For instance, in a mathematical problem, a SimpleTIR-trained model might first generate code to solve a step, then use that result to inform a second, slightly different code snippet for verification, ultimately leading to a more robust solution.
The paper details the underlying technical mechanisms, including a hierarchical Markov decision process formulation for multi-turn TIR and a feedback token masking strategy to ensure correct credit assignment. The proposed trajectory filtering is designed to be plug-and-play, requiring minimal modifications to existing RL frameworks.
In conclusion, SimpleTIR offers a promising pathway for developing more capable and reliable LLM agents for complex, multi-turn reasoning tasks, particularly in domains requiring precise calculations and information retrieval through tool interaction. By effectively addressing training instability, it paves the way for LLMs to exhibit more advanced problem-solving skills and a richer repertoire of reasoning strategies.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.