Agentic Reward Modeling Improves Language Model Reliability with Correctness Signals

🔊

💬 Ask

Large language models (LLMs) are often trained using reward models that prioritize human preferences. However, these models can be susceptible to subjective biases and may overlook factual errors or failures to adhere to instructions. Researchers at Tsinghua University have proposed a new approach called “agentic reward modeling” to address these limitations. This method combines reward models with verifiable correctness signals from different aspects, resulting in more reliable rewards and improved LLM performance.

The core idea is to create a reward system that not only considers human preferences but also incorporates automated verification agents that assess the correctness of the LLM’s responses. Think of it like having a team of editors: one focuses on whether the response sounds good (human preference), while others meticulously check for accuracy and adherence to specific guidelines.

The team built REWARDAGENT, a reward agent that combines human preference rewards with two key verifiable signals:

Factuality: This assesses the factual correctness of claims made in the response. Instead of blindly accepting everything the LLM says, a verification agent checks the facts. For example, if an LLM is asked to write a bio of Qomolangma (Mount Everest), the factuality agent would verify the stated height against reliable sources.
Instruction Following: This evaluates whether the response adheres to hard constraints specified in the instruction, such as length limitations. In the Qomolangma example, if the instruction specifies a 100-word limit, the instruction-following agent ensures that the response stays within that limit, even if the language is excellent.

The REWARDAGENT architecture consists of three main modules:

Router: Analyzes the instruction and determines which verification agents to invoke. If the prompt doesn’t require any claims, no verification of the factuality would be required.
Verification Agents: Evaluate the correctness of the response based on factuality and instruction-following.
Judger: Integrates the correctness signals from the verification agents and human preference scores from the reward models to provide an overall reward score.

The researchers conducted experiments using existing reward model benchmarks and real-world downstream tasks. The results showed that REWARDAGENT significantly outperformed vanilla reward models, demonstrating its effectiveness. In particular, the researchers use inference time best-of-n searches on TriviaQA, IFEval and CELLO, for both GPT-40 and LLama3-8B which confirms that REWARDAGENT is able to pick superior responses and scale inference time laws. Also, they construct training preference pairs using REWARDAGENT and train an LLM with the DPO objective achieving superior performance on various NLP benchmarks compared to conventional reward models

“By integrating verifiable correctness rewards with human preferences, the reward system selects the superior response”, says Hao Peng, the lead author of the study. “Agentic reward modeling enhances reliability through multi-dimensional correctness signals, enables flexible integration of diverse verification agents, and improves the interpretability of the final reward.”

This work highlights the importance of incorporating verifiable correctness signals into reward models for LLMs and provides a promising direction for future research in this area.

AI Papers Reader

Personalized digests of latest AI research

Agentic Reward Modeling Improves Language Model Reliability with Correctness Signals

Chat about this paper