Demystifying Long Chain-of-Thought Reasoning in LLMs
Large language models (LLMs) are increasingly capable of complex reasoning, thanks to advancements like chain-of-thought (CoT) prompting. However, generating truly long CoTs—sequences of reasoning steps that rival human-like problem-solving—remains a challenge. A new paper, “Demystifying Long Chain-of-Thought Reasoning in LLMs,” published on arXiv, sheds light on this process, revealing key factors influencing the length and effectiveness of CoT reasoning.
The researchers systematically investigated the mechanics of long CoT reasoning through supervised fine-tuning (SFT) and reinforcement learning (RL) experiments. Their findings offer valuable insights into optimizing LLMs for enhanced long CoT reasoning.
Four Key Findings
-
The Role of Supervised Fine-Tuning (SFT): While not strictly necessary for training LLMs to generate long CoTs, SFT significantly simplifies the training process and improves efficiency. Using SFT with long CoT examples, rather than short ones, yields considerably better results and provides a stronger foundation for subsequent RL. The researchers demonstrate this with scaling curves, showing long CoT SFT leading to higher accuracy than short CoT SFT.
-
Reward Shaping is Crucial: Simply increasing training compute doesn’t guarantee the emergence of long CoTs. Reward shaping is critical for stabilizing CoT length growth. The authors introduce a “cosine length-scaling reward” with a repetition penalty. This reward function incentivizes longer CoTs when they are correct, but penalizes excessive length, preventing reward hacking—where the model prioritizes length over accuracy. They demonstrate how this improved the stability and accuracy of long CoT generation compared to a simpler reward function focusing solely on accuracy.
-
Scaling Verifiable Reward Signals: The availability of high-quality, verifiable data is a major bottleneck. The researchers demonstrate that noisy, web-extracted solutions, with appropriate filtering, are a promising source of training data, especially for out-of-distribution (OOD) tasks. This addresses the challenge of limited high-quality training data. For example, they show the incorporation of noisy WebInstruct data with a filtering mechanism significantly improves the model’s performance, especially on OOD benchmarks.
-
Core Abilities are Inherent: Core reasoning skills like error correction and backtracking are already present in base models. However, effectively incentivizing these abilities for complex tasks through RL requires significant compute and a nuanced approach to measuring their emergence. The researchers found that even with a strong starting model, directly applying RL without careful reward shaping and optimization is not sufficient.
Practical Implications
This research provides several practical guidelines for researchers and developers:
- Prioritize long CoT SFT for better RL initialization.
- Use the proposed cosine length-scaling reward to stabilize and improve long CoT generation.
- Leverage noisy data sources like WebInstruct to overcome the limitations of high-quality, human-annotated datasets.
- Design RL incentives carefully, acknowledging the inherent abilities in base models and measuring their emergence using a nuanced approach.
The paper’s findings represent a significant step towards a better understanding of long CoT reasoning in LLMs. By addressing challenges related to training data, reward design, and the inherent capabilities of base models, this work paves the way for more powerful and reliable AI systems capable of sophisticated reasoning.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.