Meta Policy Optimization: LLMs Learn to Think Better by Evolving How They're Evaluated
Large language models (LLMs) are increasingly being used for complex tasks that require both reasoning and alignment with human values. A crucial component of this is using reward-based learning, where the LLM is trained to maximize a reward signal. However, this approach faces challenges: models can “hack” the reward signal without actually improving, and creating effective reward models often requires extensive manual prompt engineering. A new paper introduces “Meta Policy Optimization” (MPO) to address these issues.
MPO dynamically adapts the reward model during training. Imagine a student learning in a classroom: the LLM is the student, a regular reward model is like a junior instructor following a fixed grading rubric, and MPO introduces a “senior instructor.” This senior instructor (a “meta-reward model”) observes how the student is performing and how well the junior instructor’s rubric is working. If the student starts exploiting loopholes in the rubric to get high scores without doing quality work, the senior instructor steps in and refines the rubric, adapting it to the current training context to provide a more accurate and robust reward signal.
This “meta-learning” approach, inspired by metacognition (thinking about thinking), offers several advantages. First, it leads to more stable training because the reward model adjusts to the LLM’s evolving strategies, preventing exploitation of static reward signals. Second, it reduces the need for manual prompt engineering, as the system refines prompts automatically. Third, it’s a flexible framework that can be applied to various tasks without major modifications.
The researchers tested MPO on tasks like essay writing, summarization, ethical reasoning, and mathematical problem solving. In essay writing, for example, they found that MPO-aligned LLMs outperformed those trained with fixed reward models, even those using manually optimized prompts. The automatic prompt refinement reduced reliance on manual tuning and mitigated the risk of failed training, saving time and resources. The initial version of the prompt used in MPO consistently yields the lowest performance, reinforcing the benefit of the algorithm.
Interestingly, the best results occurred when the reward model and meta-reward model were of equal size. The researchers suggested identically-sized models develop similar token-usage and reasoning styles, enabling the meta-reward models rubric refinements clearer. They observed that MPO evolved rubrics contained a hierarchical structure indicative of human-like writing.
This research demonstrates that equipping LLMs with the ability to reflect on and refine their evaluation criteria leads to more robust and adaptable alignment, paving the way for more reliable and effective AI systems. The code and models will be publicly shared.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.