AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Training Reward Models More Efficiently: Implicit Process Reward Models

A new paper from researchers at Tsinghua University and the University of Illinois Urbana-Champaign proposes a novel approach to training process reward models (PRMs) for large language models (LLMs). This method significantly reduces the computational cost and data requirements compared to existing techniques, paving the way for more accessible and efficient reinforcement learning in LLMs.

The Problem with Process Reward Models

Large language models are often trained using reinforcement learning methods. These methods require a reward model to guide the learning process and steer the LLM towards better outputs. Outcome reward models (ORMs) evaluate the entire response at once, assigning a single reward score. However, ORMs suffer from the sparsity of outcome rewards, often leading to suboptimal performance. Process reward models (PRMs), in contrast, evaluate the reasoning trajectory step-by-step, providing denser and more fine-grained feedback. This theoretically leads to better performance.

The challenge with PRMs has been the high cost of training them. They require labeled data at every intermediate step of the reasoning process, making data acquisition a significant bottleneck. Existing methods for automatic annotation are computationally expensive. For instance, one approach relies on Monte Carlo Tree Search (MCTS), which requires 38 times more computational resources than training an ORM.

The Implicit PRM Approach

The researchers propose a solution they call “implicit PRMs”. Their core idea is elegant: instead of explicitly training a PRM using step-by-step labels, they show how an implicit PRM can be obtained by simply training an ORM using only response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratio of the policy and a reference model:

rθ(y) = β log [πθ(y) / πref(y)]

where:

  • rθ(y) is the outcome reward.
  • β is a hyperparameter.
  • πθ(y) is the probability of generating response y by the policy model (LLM being trained).
  • πref(y) is the probability of generating response y by a reference model (a pre-trained LLM).

The magic lies in the fact that this reward parameterization naturally induces a PRM. The log-likelihood ratio can be calculated for partial responses, effectively providing step-wise rewards without explicit labeling. This drastically simplifies the training process, eliminating the need for expensive step-by-step annotations.

Experimental Results

The researchers evaluated their implicit PRMs on the MATH dataset, a challenging benchmark requiring complex reasoning. They instantiated their implicit PRMs with different loss functions (DPO, KTO, NCA, and CE) and compared their performance against various baselines, including state-of-the-art methods like Math-Shepherd and AutoPSV. Their implicit PRMs consistently outperformed the baselines, achieving significant improvements in accuracy while using substantially less training data and computation (38-fold less computationally expensive than Math-Shepherd in their experiments).

Further analysis revealed that:

  • Scaling up the number of responses per instruction significantly improved the performance of the implicit PRMs.
  • The choice of loss function matters. Cross-entropy (CE) loss performed particularly well, even with extremely limited data (only one response per instruction).
  • Adding step labels from external methods (like Math-Shepherd) did not provide any additional benefit to the implicit PRM.
  • In many cases, the reference model could be omitted entirely without significantly impacting performance.

Conclusion

The research presents a significant advancement in the training of PRMs. The proposed implicit PRM approach offers a much more efficient and scalable solution, potentially democratizing the use of PRMs in reinforcement learning for LLMs. This work opens doors to more accessible and advanced training techniques that could lead to higher-quality language models.