TreePO: A Smarter Way to Train Large Language Models for Complex Reasoning
Large language models (LLMs) have shown remarkable progress in tackling complex reasoning tasks. However, traditional reinforcement learning (RL) methods for training these models are often computationally expensive and inefficient in exploring diverse reasoning paths. A new framework called TreePO, developed by researchers at ByteDance and M-A-P, offers a solution by transforming the training process into a more intelligent, tree-structured search.
The core of TreePO lies in its tree-based rollout algorithm. Instead of generating multiple independent, sequential responses (known as rollouts) for a single prompt, TreePO views this process as a tree. Common reasoning steps, or “prefixes,” are shared across different branches of the tree, avoiding redundant computations. Imagine training an LLM to solve a math problem. Many different paths might involve the same initial steps, like identifying the given numbers and variables. TreePO ensures these common steps are computed only once, significantly speeding up the process.
This tree structure also enables more efficient exploration. When the model encounters a point where it needs to make a decision (a branching point in the tree), TreePO can intelligently explore different options. It uses a dynamic sampling policy that leverages “local uncertainty” to decide which branches are more promising to explore. This is akin to a chess player exploring promising moves while pruning less advantageous ones early on.
Furthermore, TreePO introduces a novel segment-level advantage estimation. Instead of attributing rewards to individual tokens, it considers entire segments of reasoning as a cohesive unit. This provides a more robust way to understand which parts of the reasoning process led to success or failure, allowing for more precise adjustments during training. For instance, if a model correctly sets up an equation but makes a mistake in solving it, TreePO’s segment-level estimation can more accurately pinpoint the error.
The researchers demonstrated TreePO’s effectiveness on various reasoning benchmarks. Their experiments showed that TreePO can significantly reduce computational costs, cutting down GPU hours by 22% to 43% compared to existing methods. This efficiency gain comes without sacrificing, and often even improving, the model’s performance. Specifically, they observed up to a 40% reduction in compute time at the trajectory level and a 35% reduction at the token level.
In essence, TreePO offers a practical pathway to scale RL-based training for LLMs, making it more efficient and effective. By intelligently structuring the exploration and credit assignment processes, TreePO paves the way for LLMs to tackle increasingly complex reasoning challenges with fewer resources.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.