New Approach "Nudges" Large Language Models to Solve More Difficult Problems
San Francisco, CA – Researchers have developed a novel method, dubbed NuRL, that effectively “nudges” large language models (LLMs) to improve their reasoning capabilities, particularly on problems they would otherwise find unsolvable. This breakthrough addresses a key limitation in current LLM training, where models struggle to learn from challenging tasks that fall outside their immediate grasp.
Traditional reinforcement learning (RL) methods for LLMs, while effective at refining existing skills, often fail to push the boundaries of a model’s knowledge. If an LLM cannot even arrive at the correct answer after numerous attempts, it receives no reward signal, and thus, no learning occurs for that specific problem. This leaves the model’s maximum potential performance, or “upper limit,” unchanged.
NuRL tackles this by introducing “hints”—abstract cues designed to simplify difficult problems without directly giving away the answer. Imagine a student struggling with a complex math problem. Instead of being given the final answer, they might receive a hint that points them towards a relevant theorem or a key concept, guiding their own thought process. This is precisely how NuRL operates.
The method involves two main stages. First, it “offline collects” hints. Given a question and its correct answer, the LLM is prompted to generate a Chain-of-Thought (CoT) that explains how to arrive at the solution. Then, this CoT is used to generate an abstract hint that captures the core knowledge needed to solve the problem. Crucially, these hints are “self-generated,” meaning they are produced by the LLM itself, conditioned on the correct answer, thus avoiding the need for external, potentially biased, expert models.
In the second stage, “online rollout augmentation,” NuRL integrates these hints into the training process. During standard RL training, if a model fails to solve a problem across multiple attempts (indicated by a zero success rate), NuRL intervenes. It injects the pre-generated hint along with the original question and prompts the LLM to try again. This targeted assistance transforms previously unlearnable problems into opportunities for learning.
“This approach is akin to Vygotsky’s concept of the Zone of Proximal Development,” explained Justin Chih-Yao Chen, lead author of the paper. “It’s about providing just enough scaffolding for the learner to tackle a task they couldn’t manage alone, thereby expanding their learning zone.”
The results demonstrate NuRL’s effectiveness. Across six diverse benchmarks and three different LLMs, NuRL consistently outperformed standard RL methods like GRPO. Notably, while GRPO typically improved performance within the model’s existing capabilities, NuRL actively raised the model’s “upper limit” by enabling it to solve previously impossible problems. For instance, on challenging datasets like “Date Understanding” and “GPQA,” where GRPO showed little to no improvement in advanced performance metrics (pass@1024), NuRL pushed these scores significantly higher.
The researchers also found that the nature of the hint matters. Abstract, high-level cues that guide reasoning without revealing specific solution steps were most beneficial. Directly providing the answer, conversely, was found to be detrimental, leading to “reward hacking” where the model simply learned to output the answer rather than genuinely reason. Furthermore, NuRL proved most effective when hints were applied selectively, only for difficult problems, and after the model’s initial training had stabilized, suggesting an adaptive learning strategy.
In essence, NuRL provides a powerful mechanism for LLMs to transcend their current limitations, fostering deeper reasoning abilities and unlocking the potential for solving increasingly complex tasks.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.