Small Language Models Master Math Reasoning Through 'Deep Thinking'
Microsoft researchers unveil rStar-Math, a novel approach that allows small language models (SLMs) to rival the mathematical reasoning capabilities of significantly larger models, even surpassing them in some cases.
For years, large language models (LLMs) have shown promise in solving mathematical problems, but their performance often falls short of human capabilities, especially when dealing with complex or nuanced questions. The conventional approach, akin to “System 1” thinking, involves generating a complete solution in a single step, which can be prone to errors. To overcome this limitation, researchers have explored “System 2” thinking, which emulates human reasoning with a multi-step process. However, System 2 approaches traditionally rely on large, computationally expensive models or extensive, manually curated datasets, making them impractical for many applications.
Microsoft Research Asia’s new method, rStar-Math, changes this. It demonstrates that smaller, more computationally efficient SLMs can achieve state-of-the-art math reasoning performance using a process they call “deep thinking.” This involves combining the strengths of Monte Carlo Tree Search (MCTS) with carefully designed training strategies, enabling SLMs to learn effective reasoning capabilities without relying on distillation from superior models.
rStar-Math introduces three key innovations:
-
Code-Augmented Chain-of-Thought (CoT) Data Synthesis: Instead of relying on existing, often incomplete or noisy datasets, rStar-Math synthesizes its own high-quality data. This involves breaking down problem-solving into multiple steps guided by MCTS, where the model generates each step alongside corresponding Python code. This process verifies the correctness of each step, ensuring high-quality data for training. For example, if the problem involves calculating the area of a triangle, the model might generate steps for calculating the base and height, with verifiable Python code for each step.
-
Novel Process Preference Model (PPM) Training: Traditional reward models in System 2 approaches often struggle with noisy data or inaccurate annotations. rStar-Math’s PPM avoids this by leveraging the Q-values generated during MCTS rollouts. Instead of directly using Q-values as rewards, the PPM learns to predict the quality of each step by comparing step Q-values across multiple trajectories. This avoids the need for manually annotated step-level scores, significantly simplifying the training process.
-
Self-Evolution Recipe: The policy SLM (which generates solutions) and the PPM are iteratively evolved. Each round involves performing extensive MCTS rollouts using the current models to generate even higher-quality reasoning trajectories. These trajectories are then used to train improved policy and reward models for the next round. This four-round self-evolution process significantly boosts the SLMs’ reasoning capabilities.
The paper demonstrates rStar-Math’s effectiveness on various challenging benchmarks, including the MATH dataset and the USA Math Olympiad (AIME) problems. For instance, it improved the performance of the Qwen2.5-Math-7B model from 58.8% accuracy on MATH to 90%, surpassing the performance of OpenAI’s o1-preview model. On the AIME 2024, it achieved an average of 53.3% (8 out of 15 problems solved), placing among the top 20% of high-school students.
Remarkably, rStar-Math achieves this performance using SLMs as small as 1.5 billion parameters, highlighting the potential of this approach for resource-constrained applications. Furthermore, the paper shows that rStar-Math exhibits an unexpected emergent capability: intrinsic self-reflection. The model can sometimes identify and correct its own errors during the reasoning process without explicit training for this behavior.
Overall, rStar-Math represents a significant advancement in the field of LLM mathematical reasoning, offering a more efficient and scalable approach to achieving near human-level performance. The approach’s generalizability to other domains, such as code and commonsense reasoning, is also suggested as an area for future research.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.