UloRL: A New Approach for Enhancing Large Language Models' Reasoning Abilities with Ultra-Long Outputs
Large Language Models (LLMs) have shown remarkable progress in reasoning tasks, a feat largely attributed to advancements in reinforcement learning with verifiable rewards (RLVR). However, traditional RL frameworks struggle with ultra-long output sequences, a common characteristic of complex reasoning. This inefficiency stems from bottlenecks caused by “long-tail” samples—a few extremely long sequences that can significantly slow down the entire training process. To overcome these challenges, researchers at Tencent Hunyuan Team have introduced UloRL, an “Ultra-Long Output Reinforcement Learning” approach.
UloRL tackles the long-tail problem by dividing the ultra-long output generation into smaller, manageable segments. This “segment rollout” strategy allows completed segments to be immediately used for training, while unfinished segments continue processing in subsequent steps. This significantly speeds up training, making it more efficient. For instance, experiments showed that using four segments instead of one increased training speed by an impressive 2.06 times on the Qwen3-30B-A3B model.
Another key innovation in UloRL is its approach to prevent “entropy collapse,” a phenomenon where LLMs become overly confident in their predictions, leading to a loss of diversity and potentially suboptimal performance. UloRL introduces “Dynamic Masking of well-Mastered Positive Tokens” (DMMPTs). This method intelligently masks tokens that the model has already mastered (i.e., predicts with high confidence) only when the model’s overall “entropy”—a measure of its uncertainty or diversity—falls below a target threshold. This adaptive strategy ensures the model maintains a balanced level of exploration and exploitation, preventing premature over-specialization. The effectiveness of DMMPTs is evident in experiments where models trained with this strategy maintained stable entropy levels, unlike baseline models that showed a gradual decrease.
The research also highlights the importance of a robust reward mechanism. UloRL employs a “generative verifier model” to accurately assess the equivalence of predicted and reference answers, which is crucial for providing reliable feedback to the RL training process. To further improve data quality, the researchers implemented several data cleaning steps, such as removing overly simple questions and ensuring clarity in reference answers.
The results are compelling. When applied to the Qwen3-30B-A3B model, UloRL achieved a significant leap in performance, improving accuracy on the AIME2025 reasoning benchmark from 70.9% to 85.1% and on the BeyondAIME benchmark from 50.7% to 61.9%. Notably, the UloRL-trained model even surpassed the performance of a larger model, Qwen3-235B-A22B. The study also confirmed a positive correlation between output length and reasoning capability, with performance gains becoming substantial when extending output lengths to 64k tokens and beyond.
In essence, UloRL offers a practical and effective solution for training LLMs on complex, long-form reasoning tasks by optimizing the RL training process for extended sequences and stabilizing model learning through intelligent entropy management. The researchers plan to release their code and models, making these advancements accessible to the broader AI community.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.