New Approach Boosts LLM Reasoning by Eliminating Inefficiencies in Training

🔊

💬 Ask

Tencent researchers unveil “Single-stream Policy Optimization” (SPO), a novel method for training Large Language Models (LLMs) that significantly improves efficiency and performance by streamlining the reinforcement learning process.

Existing methods for training LLMs using reinforcement learning (RL) often generate multiple responses for a single prompt and then use these “groups” to establish a baseline for learning. While this approach has shown promise, it suffers from critical drawbacks. One major issue is “degenerate groups,” where all responses in a group yield the same outcome (e.g., all incorrect). In such cases, the learning signal collapses to zero, effectively wasting computational resources and data. Another bottleneck arises in distributed training, where the entire group must wait for the slowest response to be generated, hindering scalability, especially in complex, multi-turn tasks.

To overcome these limitations, the Tencent team has developed Single-stream Policy Optimization (SPO). SPO returns to a simpler, “single-stream” paradigm, where each training sample consists of a single prompt-response pair. This design inherently avoids the issues of degenerate groups and synchronization barriers.

Instead of per-group baselines, SPO utilizes a persistent, KL-adaptive value tracker. This tracker maintains a continuously updated estimate of the probability of a correct response for a given prompt. This provides a stable, low-variance learning signal for each individual sample. Furthermore, SPO normalizes advantages globally across the entire batch of samples, ensuring a robust learning signal.

The benefits of SPO are substantial. Experiments using the Qwen3-8B LLM on five challenging math benchmarks demonstrated that SPO consistently outperforms the established Group Relative Policy Optimization (GRPO) method. For instance, SPO achieved an average improvement of 3.4 percentage points in the “maj@32” metric, which measures the correctness of the majority-voted answer among 32 responses. Notable gains included a +7.3 percentage point improvement on the BRUMO 25 benchmark and +4.4 percentage points on AIME 25. The “pass@k” metric, which assesses the probability of solving a problem within a given number of attempts, also showed consistent gains for SPO across various values of k.

Beyond accuracy improvements, SPO offers significant scalability advantages. In simulated agentic training scenarios with variable interaction times, SPO demonstrated a 4.35x speedup in training throughput compared to GRPO. This is attributed to its group-free architecture, which eliminates the synchronization bottleneck and allows for more flexible batching strategies. This makes SPO particularly well-suited for complex, long-horizon tasks that involve intricate tool use or multiple interaction turns.

The researchers emphasize that SPO’s success lies in returning to fundamental RL principles rather than adding incidental complexity to existing algorithms. By offering a more stable, efficient, and scalable approach, SPO provides a strong foundation for future advancements in LLM reasoning and agentic training.

AI Papers Reader

Personalized digests of latest AI research

New Approach Boosts LLM Reasoning by Eliminating Inefficiencies in Training

Chat about this paper