Reinforcement Learning Takes the 'Path Not Taken,' Rewriting LLM Parameters in Secret, Low-Curvature Regions

🔊

💬 Ask

New Study Resolves Paradox of RL Sparsity, Revealing a Geometry-Driven Optimization Bias that Dictates How AI Models Learn Reasoning.

Reinforcement Learning with Verifiable Rewards (RLVR) has dramatically advanced the reasoning capabilities of large language models (LLMs) in complex tasks like math and coding. Yet, this high-gain training process presents a persistent paradox: unlike Supervised Fine-Tuning (SFT), RLVR achieves performance breakthroughs by apparently modifying only a tiny, sparse fraction of the model’s parameters.

A new paper from researchers at Meta AI and the University of Texas at Austin resolves this puzzle, demonstrating that this perceived sparsity is merely a “superficial artifact” of a deeper, persistent optimization bias rooted in the geometry of the pretrained model. RL is not actually updating fewer weights; it is updating them consistently in specific, hidden regions of parameter space.

The researchers formalize this mechanism using the “Three-Gate Theory,” explaining how RLVR updates are constrained, steered, and filtered.

First, Gate I (KL Anchor) enforces a conservative update, ensuring the new policy stays close to the old one—an implicit “leash” that prevents drastic changes. Second, Gate II (Model Geometry) dictates where this conservative step lands. Crucially, the pretrained model’s optimization landscape steers updates away from high-curvature, high-energy directions, favoring stable, low-curvature subspaces. The result is a consistent, non-random routing pattern likened to an “implicit compass” guiding RL along a low-curvature detour.

Finally, Gate III (Precision) explains the visual sparsity. Because modern LLMs use bfloat16 precision, the conservative micro-updates routed to non-preferred regions are often smaller than the system’s numerical threshold (Unit in the Last Place, or ULP). These small updates are hidden, amplifying the apparent sparsity.

Avoiding the Principal Weights

The key distinction lies in the optimization path compared to SFT. The study found that SFT training targets the “principal weights”—the high-magnitude parameters associated with high-curvature directions, which are essential for the model’s core functional pathways.

RLVR, by contrast, consistently avoids these principal weights. Instead, it makes off-principal updates that preserve the model’s foundational spectral structure—the internal organization of its knowledge. The RLVR process maintains minimal “spectral drift” and “subspace rotation,” ensuring the model’s core competencies remain stable while making surgical tweaks in safer, low-energy regions to align with rewards. When researchers deliberately scrambled the pretrained model’s geometry using orthogonal rotations, the characteristic RL optimization bias vanished, confirming that the pre-trained structure is the steering core.

Rethinking PEFT for RL

This discovery has immediate practical implications for parameter-efficient fine-tuning (PEFT). The findings show that PEFT methods designed around SFT’s geometry—which typically prioritize updating principal directions (e.g., PiSSA, a LoRA variant)—are fundamentally misaligned with RLVR.

In experiments, restricting updates to principal weights yielded the worst optimization trajectory and degraded performance. Conversely, methods that allow updates in the low-magnitude, non-principal regions—the “safe mask” identified by the theory—closely tracked the performance of full fine-tuning. Forcing updates into SFT-favored, high-curvature directions (as seen with high learning rates in PiSSA) often destabilized training and caused early collapse, whereas standard LoRA, which naturally accommodates off-principal updates, remained robust.

The results offer the first parameter-level, “white-box” account of RLVR dynamics, moving beyond performance metrics to explain how models adapt and setting the stage for developing new, geometry-aware PEFT algorithms native to the reinforcement learning regime.

AI Papers Reader

Personalized digests of latest AI research

Reinforcement Learning Takes the 'Path Not Taken,' Rewriting LLM Parameters in Secret, Low-Curvature Regions

Avoiding the Principal Weights

Rethinking PEFT for RL

Chat about this paper