AI Papers Reader

Personalized digests of latest AI research

View on GitHub

A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy and Efficiency

A recent paper, “A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning,” introduces a novel training methodology designed to significantly improve the performance and efficiency of large language models (LLMs) in mathematical reasoning tasks. The research, presented by Yoshihara, Yamaguchi, and Inoue, proposes a synergistic approach that combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to create LLMs that are both highly accurate and computationally efficient.

The core of their proposed “recipe” involves a two-stage process. The first stage focuses on maximizing accuracy through an extended period of Supervised Fine-Tuning (SFT). Contrary to common practice, the researchers found that extending SFT for up to 10 epochs, even if initial performance shows a slight dip, leads to significant breakthroughs in problem-solving accuracy. This prolonged SFT phase essentially pushes the model’s capabilities to their peak.

Following this intensive SFT phase, the second stage employs Reinforcement Learning (RL), specifically Group Relative Policy Optimization (GRPO), not primarily to further boost accuracy, but to enhance the model’s efficiency. The key insight here is that GRPO excels at optimizing the length of generated solutions. By applying GRPO after the SFT phase, the models achieve dramatically shorter reasoning paths, reducing computational overhead and latency, while crucially maintaining the high accuracy established in the first stage.

The paper emphasizes that SFT and RL are not competing but complementary. SFT builds the foundational accuracy, while GRPO refines the model for practical use by making its responses more concise. This sequential strategy offers a clear and reproducible path to developing advanced mathematical reasoners.

To demonstrate the efficacy of their approach, the researchers conducted extensive evaluations on challenging benchmarks, including AIME 2024, AIME 2025, and the AI Mathematical Olympiad (AIMO). The AIMO benchmark is particularly noteworthy as it is designed to be strictly leak-free and highly competitive. The proposed method achieved top-tier performance, securing a high rank among over 2,200 teams in the AIMO competition. This success highlights the robustness and real-world applicability of their training recipe.

For instance, in experiments with a 14-billion parameter model, the “Original + RL” configuration showed improved token efficiency but a drop in accuracy. However, when the model first underwent 10 epochs of SFT and then RL (“+ SFT (10 epochs) + RL”), it achieved both higher accuracy and better token efficiency compared to the original model and the RL-only approach. This outcome clearly illustrates the synergistic benefit of the two-stage strategy.

The researchers will be open-sourcing their entire framework, including code, model checkpoints, and training configurations, to promote reproducibility and facilitate future research in this area. This work provides a valuable blueprint for the AI community aiming to develop more capable and efficient mathematical LLMs.