Rethinking Reward Models: Generative Outcome Verification Outperforms Step-by-Step Scoring in Diverse Domains
A surprising discovery in the evaluation of large language models (LLMs) suggests that a simpler approach to verifying their reasoning might be more effective, particularly across a wide range of tasks. A new paper published on arXiv challenges the long-held assumption that detailed, step-by-step evaluations of LLM reasoning are always superior to examining the final outcome.
Traditionally, researchers have relied on “reward models” to guide LLMs towards better performance. These models act as external judges, evaluating the quality of an LLM’s output. There are two main types: those that assess the final answer (Outcome Reward Models or ORMs) and those that scrutinize each reasoning step (Process Reward Models or PRMs). The prevailing wisdom, largely based on math-related problems, has been that PRMs, by providing fine-grained feedback at each step, are more effective than ORMs.
However, this new study, which comprehensively evaluated four variants of these reward models – discriminative and generative ORMs and PRMs – across 14 diverse domains, reveals a different story. The researchers found that for broader applications beyond math, generative outcome reward models (gORMs) consistently outperformed other methods, including PRMs.
Think of it like this: Imagine you’re asking an LLM to explain a complex scientific concept.
-
Discriminative PRM: This is like a teacher who meticulously checks every sentence an LLM writes, flagging any minor error or logical flaw, even if the overall conclusion is correct. The paper found this detailed, step-by-step approach, while effective in specific math problems, struggles with longer, more nuanced reasoning chains common in broader domains. It can get bogged down by initial errors, even if the LLM later corrects itself.
-
Generative ORM: This is akin to an expert who reads the entire explanation and then gives a holistic judgment on whether the final answer and the overall reasoning are sound. The study discovered that this approach, especially when using generative LLMs to produce the verification itself, is surprisingly robust. It can effectively assess complex reasoning, even when the LLM makes minor mistakes along the way, as long as the final outcome is correct.
The researchers attribute the success of gORMs to their ability to provide a more holistic evaluation, which is less susceptible to accumulating errors in long reasoning chains or being misled by noisy labels that can plague LLM-generated training data for PRMs. In contrast, PRMs, by focusing on individual steps, can be disproportionately affected by the length of the reasoning process and the imperfections of the training data.
This research challenges the convention that more granular supervision is always better and suggests that for diverse, real-world LLM applications, a generative approach that focuses on the overall outcome might be a more reliable and effective strategy for ensuring accurate and trustworthy AI responses. The team has made their code, datasets, and checkpoints publicly available to foster further research in this area.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.