RLEP: Replay Your Way to Smarter Language Models
Large language models (LLMs) are becoming increasingly adept at complex reasoning tasks, but training them with reinforcement learning (RL) can be a challenging and energy-intensive process. Models can become unstable, and their performance may plateau or even degrade over time. Now, researchers from Kuaishou Technology have introduced a novel framework called Reinforcement Learning with Experience Replay (RLEP) that tackles these issues by leveraging past successes to accelerate learning and improve final performance.
The core idea behind RLEP, as described in their paper, is to create a “memory” of successful reasoning paths and then strategically “replay” these successes during training. This is a two-phase approach:
1. Experience Collection: In the first phase, a standard RL training run is conducted with a “seed” LLM. During this phase, the model generates multiple reasoning attempts (called “trajectories”) for various problems. The key here is that only the trajectories that lead to a verified correct answer are kept and stored in an “experience pool.” This pool essentially becomes a curated collection of high-quality solutions.
2. Replay-Based Training: In the second phase, the model undergoes further training, but with a twist. At each training step, the model generates new reasoning trajectories as usual. However, it then mixes these new attempts with a small batch of successful trajectories randomly sampled from the previously collected experience pool. This blend of “fresh” exploration and “replayed” successes allows the model to reinforce proven reasoning strategies while still discovering new ones.
The researchers liken this process to a mountaineering expedition. Imagine a climber exploring multiple routes to reach a summit. They might find a good route but eventually run out of energy before reaching the very top. With RLEP, it’s like that climber then retraces their successful path to that intermediate point, conserves energy, and uses that momentum to push further and reach an even higher peak.
The impact of RLEP is significant. The paper reports that on the Qwen2.5-Math-7B base model, RLEP achieved accuracy improvements across several challenging reasoning benchmarks. For instance, accuracy on the AIME-2024 math competition rose from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on the AMC-2023 test from 77.0% to 82.2%. Crucially, RLEP not only led to these higher final scores but also enabled the model to reach peak performance with substantially fewer training updates compared to the baseline.
This approach helps steer the model away from unproductive exploration, focusing its learning on more promising reasoning paths, and ultimately leading to both faster convergence and stronger, more stable results. The researchers have made their code, datasets, and checkpoints publicly available to encourage further research in this promising area.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.