Automating Reward Modeling for Large Language Model Agents
Large language models (LLMs) are revolutionizing AI, but their application to tasks requiring multi-step decision-making and environmental feedback remains challenging. A new framework, ARMAP (Autonomous Agents from Automatic Reward Modeling and Planning), tackles this limitation by automating the learning of reward models, thereby improving the performance of LLM-based agents.
The core problem addressed by ARMAP is the difficulty of training LLMs for agent tasks. Unlike pure text generation, collecting large-scale data for agent tasks, which involve sequences of actions and environmental responses, is expensive and labor-intensive. Furthermore, many powerful LLMs are only accessible through APIs, which restricts fine-tuning for specific agent tasks.
ARMAP bypasses these challenges by focusing on automatically learning a reward model instead of directly training a policy model (which maps states to actions). The intuition is that evaluating the quality of an action sequence is often easier than generating the optimal sequence. Imagine instructing an AI to buy shoes online: judging whether a sequence of clicks led to the desired result (e.g., correct size, price) is simpler than devising the entire sequence from scratch.
The ARMAP framework consists of three steps:
-
Trajectory Collection: An LLM-based agent interacts with the environment (e.g., a webshop, a simulated science lab) to generate diverse action trajectories, aiming for a given task, such as finding a particular product.
-
Reward Data Generation: A second LLM analyzes the collected trajectories, refining the initial task instructions (if needed) and generating both positive (successful) and negative (unsuccessful) trajectory examples based on the refined instructions. This automated creation of training data avoids the need for manual labeling.
-
Reward Model Training & Planning: A Vision-Language model (VILA) is trained on the generated positive and negative trajectory triplets (task intent, positive trajectory, negative trajectory). This reward model assesses the quality of a trajectory, giving higher scores to trajectories that successfully complete the task. This model then guides the LLM agent’s actions using various planning algorithms (such as Monte Carlo Tree Search or “Best-of-N,” which selects the best trajectory from multiple candidate trajectories).
The authors evaluated ARMAP on three diverse benchmarks: a webshop simulation, a simulated science environment (ScienceWorld), and a mathematical puzzle game (Game of 24). In each case, ARMAP significantly outperformed baselines that did not utilize a learned reward model. Importantly, ARMAP’s effectiveness extended to different planning algorithms and LLM agents, demonstrating its flexibility and generalizability.
ARMAP also offers a degree of control over the agent’s behavior. By customizing the reward function during inference, the researchers could prioritize certain aspects, such as minimizing the number of actions or the total cost, without needing to retrain the LLM. For instance, in the webshop scenario, they could guide the agent toward cheaper products while still achieving the main shopping goal.
This research represents a significant advancement in applying LLMs to complex tasks. By automating the learning of reward models, ARMAP overcomes data scarcity and API limitations, paving the way for more sophisticated AI agents capable of tackling a wider range of real-world problems. The automated reward model learning avoids costly and time-consuming human annotation, thus making such AI agents more readily deployable in the real-world.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.