RL-PLUS: A New Approach to Enhance LLM Reasoning and Overcome Capability Limits
Large Language Models (LLMs) have seen significant advancements in complex reasoning tasks through Reinforcement Learning with Verifiable Reward (RLVR). However, a critical limitation has emerged: RLVR often struggles to push LLMs beyond their inherent capabilities, leading to a “capability boundary collapse” that narrows their problem-solving scope. This paper introduces RL-PLUS, a novel hybrid-policy optimization approach designed to address this challenge by synergizing internal exploitation of existing knowledge with external data for learning new reasoning pathways.
The core issue, as highlighted in the paper, is that current RLVR methods tend to focus on refining existing reasoning patterns rather than discovering entirely new ones. This “inward exploitation” can lead to models that perform well on familiar tasks but fail to generalize or innovate. RL-PLUS aims to overcome this by integrating two key techniques:
-
Multiple Importance Sampling (MIS): This technique addresses the challenge of integrating external data by mitigating distributional mismatch. When learning from external datasets, there’s an inherent difference between the model’s current policy and the policy that generated the data. MIS provides a way to estimate outcomes more accurately, reducing bias and variance, and allowing the model to learn from diverse data sources without instability.
-
Exploration-Based Advantage Function: This component is designed to actively encourage the discovery of novel reasoning paths. It reshapes the learning objective to prioritize rewarding “hard-to-explore” but correct reasoning steps. For example, if an LLM is solving a complex math problem and takes a path that is less common but ultimately leads to the correct answer, this function amplifies the reward signal for that specific step, guiding the model to explore such under-appreciated pathways.
The effectiveness of RL-PLUS is demonstrated through extensive experiments. On six math reasoning benchmarks, RL-PLUS achieved state-of-the-art performance, outperforming existing RLVR methods. Furthermore, it showed superior generalization capabilities on six out-of-distribution reasoning tasks, indicating its ability to learn more robust and transferable reasoning skills. The approach also proved effective across diverse LLM families, showing consistent and significant gains.
A key finding from the study is the analysis of “pass@k” curves. These curves measure the probability of a model finding at least one correct solution within k attempts. While RLVR methods often improve pass@1 (finding one correct solution), their advantage diminishes as k increases, indicating that they might not be expanding the set of solvable problems. RL-PLUS, however, consistently shows a widening gap with the base model as k increases, demonstrating its success in breaking through capability limitations and expanding the LLM’s problem-solving repertoire.
In essence, RL-PLUS offers a more effective strategy for LLM reasoning by balancing the refinement of existing knowledge with the active exploration of new, potentially more efficient, reasoning pathways, ultimately leading to more capable and versatile AI models.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.