Language Models Can Now Learn to Judge Themselves, Leading to Improved Instruction Following

🔊

💬 Ask

Large language models (LLMs) are rapidly advancing, but training them typically relies on costly human-generated data. Recent work has shown that LLMs can improve themselves by judging their own responses, but existing methods have primarily focused on improving response quality, resulting in rapid saturation during iterative training.

A new paper, “Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge,” introduces a novel method called Meta-Rewarding that addresses this limitation. The authors propose a three-role system where the LLM acts as an actor, generating responses, a judge, evaluating those responses, and a meta-judge, evaluating the judge’s judgments.

The meta-judge plays a key role in improving the LLM’s judgment skills. It helps the LLM learn to recognize good judgments and avoid bias, such as favoring longer responses over shorter ones.

The authors demonstrate the effectiveness of Meta-Rewarding by training Llama-3-8B-Instruct, a large language model, in an iterative self-play loop. They evaluate the model on AlpacaEval 2, Arena-Hard, and MT-Bench, which are popular benchmarks for assessing instruction following, question answering, and multi-turn conversation abilities.

The results show significant improvement in the model’s performance on all three benchmarks. For example, on AlpacaEval 2, the model’s length-controlled win rate increased from 22.9% to 39.4% after four iterations of Meta-Rewarding training, surpassing even GPT-4-0314, a powerful language model.

The authors also demonstrate that Meta-Rewarding improves the model’s ability to judge responses, showing a stronger correlation with human judgments.

This research suggests that Meta-Rewarding could significantly advance the development of LLMs. It opens up a promising direction for self-improvement, potentially leading to even more powerful and reliable AI systems without relying on expensive and time-consuming human feedback.

AI Papers Reader

Personalized digests of latest AI research

Language Models Can Now Learn to Judge Themselves, Leading to Improved Instruction Following

Chat about this paper