AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Train Smarter, Not Harder: New Method Shrinks LLM Reasoning While Boosting Accuracy

New research introduces a “curriculum learning” approach to train Large Language Models (LLMs) to reason more efficiently, producing shorter, more precise outputs without sacrificing accuracy.

Large Language Models (LLMs) have shown remarkable progress in tackling complex reasoning tasks. However, a persistent challenge is their tendency to generate overly verbose explanations, leading to increased computational costs. Traditional methods for controlling this verbosity often rely on fixed “token budgets” during training, meaning the model is always constrained to a certain length, regardless of its learning stage.

This new paper, “Train Long, Think Short: Curriculum Learning for Efficient Reasoning,” proposes a more nuanced approach: curriculum learning for length control. Inspired by how humans learn, the method gradually tightens the token budget throughout the training process. This allows LLMs to first explore and discover effective reasoning strategies with a generous budget, and then “distill” these strategies into more concise explanations as the budget shrinks.

The researchers developed a strategy called Group Relative Policy Optimization (GRPO), enhanced with a reward system that balances three key elements:

  • Task Correctness: Ensuring the final answer is accurate, verified by an automated system.
  • Length Efficiency: Rewarding adherence to the dynamically changing token budget.
  • Formatting Adherence: Encouraging structured reasoning and clear separation of the final answer using specific tags.

To illustrate the core idea, imagine training a student to solve a complex math problem. Initially, you might allow them to write out every single step, even the obvious ones, to ensure they understand the process. As they gain confidence, you’d gradually ask them to be more concise, focusing only on the critical steps, but still requiring a correct answer. This is the essence of the proposed curriculum.

The researchers tested their approach on several mathematical reasoning datasets, including GSM8K (grade-school arithmetic) and MATH500 (more challenging, competition-level problems). They found that their curriculum learning method consistently outperformed models trained with fixed token budgets. Even when both approaches finished training with the same tight budget, the curriculum-trained models achieved higher accuracy and were significantly more token-efficient.

For example, on the GSM8K dataset, their model achieved 86.20% accuracy using an average of 87 tokens, compared to a baseline model that used 258.4 tokens for 83.55% accuracy. This demonstrates not only better efficiency but also improved performance. Crucially, the benefits extended to “out-of-distribution” datasets, suggesting the models generalize better.

The study also highlighted the importance of the decay schedule – how quickly the token budget shrinks. They found that faster decays led to better efficiency, especially on harder tasks, while slower, more gradual decays allowed for more exploration and benefited easier datasets. The choice of how to reward length (e.g., a smooth increase in reward up to the budget, versus a flat reward) also impacted performance, with a gradual increase proving more effective at maintaining accuracy.

In essence, this research offers a powerful and adaptable method for training LLMs to be both intelligent and economical, paving the way for more efficient and scalable AI reasoning. The team has also released their code to foster further research in this area.