AI Papers Reader

Personalized digests of latest AI research

View on GitHub

DeepSeek-R1: A Reinforcement Learning Approach to Boosting Reasoning in LLMs

Large language models (LLMs) have made significant strides in recent years, but their reasoning capabilities remain a challenge. A new paper introduces DeepSeek-R1, a novel approach that leverages reinforcement learning (RL) to significantly improve the reasoning abilities of LLMs, even without extensive supervised fine-tuning.

The researchers from DeepSeek-AI began by training DeepSeek-R1-Zero. This model underwent large-scale RL training without any prior supervised fine-tuning (SFT). This is a significant departure from most existing methods, which typically rely on SFT as a crucial preliminary step. Remarkably, DeepSeek-R1-Zero developed impressive reasoning behaviors, including self-verification and generating complex chains of thought (CoT) – sequences of reasoning steps leading to a final answer. However, its output suffered from poor readability and language mixing.

To address these shortcomings, the researchers introduced DeepSeek-R1. This model incorporated several key improvements:

  1. Cold Start Data: DeepSeek-R1 began with a small amount of “cold start” data—carefully curated examples of reasoning processes—to fine-tune a base model. This provided a better starting point for the RL training, improving the quality of the generated reasoning.
  2. Multi-Stage Training: The training process for DeepSeek-R1 was divided into multiple stages. After the initial fine-tuning, reasoning-oriented RL was applied, focusing on tasks requiring rigorous logical deduction. Next, rejection sampling was used to create additional SFT data, ensuring the model learned from successful reasoning examples and discarding the unsuccessful ones. The enhanced data was used to retrain the model, followed by another stage of RL to refine performance in various scenarios.

The results were compelling. DeepSeek-R1 achieved performance comparable to OpenAI’s highly regarded o1-1217 model on several benchmarks. For example:

  • AIME 2024: DeepSeek-R1 achieved a Pass@1 score of 79.8%, slightly surpassing OpenAI o1-1217.
  • MATH-500: DeepSeek-R1 scored an impressive 97.3% Pass@1, on par with OpenAI o1-1217.
  • Codeforces: DeepSeek-R1 achieved a percentile of 96.3%, exceeding 96% of human participants in coding competitions.

The paper also explored distilling the knowledge learned by DeepSeek-R1 into smaller, more efficient models. Using Qwen and Llama base models, the researchers created distilled versions of DeepSeek-R1, significantly outperforming other open-source models on several benchmarks. For instance, the 14B parameter distilled model outperformed the state-of-the-art open-source QwQ-32B-Preview.

The study highlights several key contributions. The use of pure RL to develop reasoning capabilities in LLMs without SFT, as shown in DeepSeek-R1-Zero, offers a new avenue for research. The multi-stage training process of DeepSeek-R1, incorporating cold-start data and rejection sampling, proved remarkably effective in improving reasoning performance and readability. Finally, the successful distillation to smaller models showcases a practical path towards making advanced reasoning capabilities more widely accessible. While some challenges remain, such as handling language mixing and prompt engineering, DeepSeek-R1 represents a significant step forward in the quest to improve the reasoning abilities of LLMs. The open-sourcing of the DeepSeek-R1 model and associated smaller models allows the research community to further build upon these advances.