AI Papers Reader

Personalized digests of latest AI research

View on GitHub

A New Approach to Training Reward Models for Language Models

Reward models are essential for aligning language models to follow instructions, but there’s been a lot of debate about which training paradigm is better: Bradley-Terry style or Regression style. To answer this question, a research team from NVIDIA, in collaboration with Georgia Tech, has released a new dataset, HelpSteer2-Preference, and a new paper describing its significance.

HelpSteer2-Preference is a dataset of 7,118 preference pairs, each annotated with a human-written justification explaining the annotator’s preference. The key innovation in this work is that the team collected this dataset alongside the existing HelpSteer2 dataset, which features ratings on a Likert-5 scale. This allows for a direct comparison of the two paradigms under identical conditions.

The team’s findings suggest that neither approach is inherently better than the other. Instead, the key to success is choosing the right model architecture and training methods for each approach. The research team found that a Scaled Bradley-Terry model trained from scratch outperformed all other models on the RewardBench benchmark, achieving a score of 92.7.

However, the real breakthrough comes from the team’s novel approach to combining the two paradigms. By initializing a Scaled Bradley-Terry model with a Helpfulness-Only SteerLM Regression model, the team achieved a top score of 94.1 on the RewardBench benchmark. This model significantly outperformed existing reward models, demonstrating the power of combining the two paradigms.

Finally, the research team demonstrated the effectiveness of their new reward model at aligning models in Reinforcement Learning from Human Feedback (RLHF) pipelines. They found that the model led to significant improvements in the performance of the language models on a range of metrics, including response length, helpfulness, and the ability to follow instructions.

This work is a significant contribution to the field of reward modeling. By providing a new dataset and a novel approach to combining the two most popular paradigms, the research team has provided a valuable resource for researchers and practitioners seeking to align language models. The release of the HelpSteer2-Preference dataset and the code for their reward model will allow other researchers to build upon this work and further advance the field of language model alignment.