AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Soft Adaptive Policy Optimization Stabilizes RL Fine-Tuning of Large Language Models

A team from Qwen (Alibaba Inc.) has introduced a novel reinforcement learning (RL) optimization strategy, Soft Adaptive Policy Optimization (SAPO), designed to address the persistent challenge of instability when fine-tuning Large Language Models (LLMs). By replacing the brittle, all-or-nothing constraints used in current RL methods with a smooth, temperature-controlled “soft gate,” SAPO dramatically improves training stability and performance across complex reasoning tasks.

RL is crucial for enhancing the complex reasoning capabilities of LLMs in fields like mathematics and coding. However, existing policy optimization methods, such as Group Sequence Policy Optimization (GSPO) and Group Relative Policy Optimization (GRPO), often fail due to high variance in token-level importance ratios—a measure of how far off-policy a sampled token is.

The Instability of Hard Clipping

Traditional methods rely on “hard clipping,” which sets a rigid boundary for policy updates. If an update falls outside this boundary, the gradient signal is completely truncated, or “clipped.”

The researchers explain that this hard boundary is inherently fragile. For instance, in GSPO, if an LLM generates a long sequence of text (a mathematical proof, for example) where 99 tokens are excellent but a single token is slightly too far off-policy, GSPO clips the gradient for the entire sequence, discarding all the useful learning signal.

“Hard clipping makes it difficult to strike a favorable trade-off,” the authors note. It leads to unstable updates and poor sample efficiency because otherwise useful data is prematurely zeroed out.

The Soft Solution: Continuous Trust Region

SAPO overcomes this limitation by implementing a smooth, adaptive soft gate. Conceptually, this replaces the binary “on/off” switch of hard clipping with a continuous dimmer.

When a token update is close to the current policy (on-policy), SAPO fully preserves the gradient, encouraging exploration. As the token deviates further, SAPO doesn’t immediately truncate the signal; instead, it smoothly and continuously attenuates (down-weights) the gradient. This creates a flexible, continuous trust region, ensuring that moderately off-policy tokens still contribute a valuable, though reduced, signal to the optimization.

Crucially, SAPO is token-adaptive yet sequence-coherent. In the scenario of the LLM generating a 100-token proof, SAPO will only down-weight the offending outlier token, allowing the learning signal from the 99 near-on-policy tokens within the same sequence to persist. This selective attenuation significantly boosts sampling efficiency and mitigates signal loss.

Asymmetric Stability Control

A second key innovation for stability is SAPO’s use of asymmetric temperatures. LLMs operate with vast vocabularies, and “negative updates” (those that decrease the probability of undesirable tokens) are notoriously unstable. These updates tend to diffuse instability across the entire vocabulary, often leading to training collapse.

SAPO uses different decay parameters for positive and negative updates, setting a higher “temperature” for negative tokens. This acts like a stiffer spring, forcing the highly volatile negative gradients to decay more rapidly back toward stability, a design the team found critical for preventing early training collapse.

Empirical testing on mathematical reasoning benchmarks (such as AIME and HMMT) demonstrated that SAPO maintains stability for a longer duration and achieves superior final Pass@1 accuracy compared to hard-clipping baselines. Furthermore, deploying SAPO to fine-tune the Qwen3-VL multimodal models yielded consistent performance gains across different model sizes and tasks, establishing SAPO as a reliable and effective strategy for large-scale RL fine-tuning.