The Scalpel, Not the Hammer: How Surgical Edits Unlock LLM Reasoning
For years, researchers have known that Reinforcement Learning with Verifiable Rewards (RLVR) can dramatically boost the reasoning capabilities of Large Language Models (LLMs). By rewarding models for reaching the correct answer in math or coding, techniques like GRPO have turned standard AI into world-class problem solvers. However, a fundamental mystery remained: what is actually happening “under the hood” at the individual word—or token—level during this training?
A new study by the Qwen Pilot Team at Alibaba Group, titled “Sparse but Critical,” provides a surprising answer. RLVR does not perform a total overhaul of how a model thinks. Instead, it acts like a high-precision scalpel, making sparse, surgical edits to a tiny fraction of the model’s decisions.
The Power of the 4%
The researchers analyzed how models like Qwen2.5 changed after RLVR training. They found that for the vast majority of a generated response—often between 83% and 98% of the time—the model’s internal “probability map” for the next word remains almost identical to the base version.
To test if these rare changes actually mattered, the team conducted “cross-sampling” experiments. They took a “dumb” base model and allowed it to generate a long math solution, but at a few critical junctions, they forced it to pick the token the “smart” RL-trained model would have chosen.
The results were startling: by replacing fewer than 4% of the tokens in a base model’s response with RL-selected ones, the model’s accuracy on difficult math benchmarks (like AIME) shot up to the level of the fully trained RL model. Conversely, when they took a smart RL model and forced it to use the base model’s word choices for just 5% of the response, its performance collapsed.
Shaping the Path
To build an intuition for this, imagine an LLM solving a complex geometry problem. The base model might have a 40% chance of starting with “Let $x$ be the radius” and a 35% chance of starting with “Let $x$ be the diameter.” Both are plausible, but one leads to a much simpler calculation.
The researchers found that RLVR doesn’t “invent” new ways of speaking or brand-new mathematical concepts. Instead, it identifies these high-stakes “branching points”—often located at the very beginning of a response—and slightly shifts the weights. It might nudge that 35% diameter choice up to 60%, effectively steering the model toward a more successful “reasoning trajectory.”
The study confirms that these critical edits are context-dependent. A common word like “the” might be a low-impact filler in one sentence but a high-divergence “steering” token in another, depending on whether it precedes a crucial variable or a specific logical step.
A Targeted Refinement
This “sparse” nature of reinforcement learning stands in stark contrast to Supervised Fine-Tuning (SFT). While SFT tends to rewrite the model’s behavior broadly across the entire response, RLVR is far more selective. It focuses its energy on “high-entropy” moments—points where the model is uncertain—and reorders the top few candidates that the model was already considering.
These findings suggest that the future of AI training may lie in “divergence-weighted” learning. By identifying and focusing training signals specifically on these few high-impact tokens, researchers may be able to train even more powerful reasoning models with significantly less computational waste. RLVR, it seems, isn’t teaching models new languages; it’s teaching them how to choose the right path at the crossroads.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.