AI Papers Reader

Personalized digests of latest AI research

View on GitHub

QLASS: Guiding Language Agents with Stepwise Q-Value Feedback

Large language models (LLMs) are increasingly used to build language agents capable of complex interactions. These agents act within environments, making decisions based on observations and aiming for a final reward. However, training these agents efficiently is challenging due to the scarcity of labeled data detailing the full trajectory of successful interactions. Existing methods often rely on only the final outcome reward, neglecting valuable intermediate feedback. This limitation can lead to suboptimal policies.

A new paper introduces QLASS (Q-guided Language Agent Stepwise Search), a novel framework designed to address this data scarcity problem. Instead of relying solely on the final outcome reward, QLASS estimates intermediate Q-values, effectively creating “process rewards” that guide the agent during both training and inference. These Q-values represent the expected future cumulative reward given a specific action at a specific state within a task.

Building an Intuition: Navigating a Webshop

Imagine an agent tasked with finding a specific item on a webshop. Traditional methods might only reward the agent if it successfully finds the item. This provides limited information about the effectiveness of the intermediate steps. The agent may stumble upon the item through random search actions, which would be rewarded equally to a more efficient, strategic search.

QLASS approaches the task differently. It first uses a supervised fine-tuning (SFT) phase using a dataset of human-demonstrated successful searches. Then, it leverages the trained agent to explore the webshop, creating an “exploration tree”. This tree maps out different potential search trajectories, each node representing an action taken and its associated observation.

Crucially, QLASS estimates a Q-value for each node in the exploration tree based on the Bellman equation, commonly used in reinforcement learning. This effectively assigns a value to each action based on its contribution to the ultimate success of the task. For example, clicking on a seemingly irrelevant link early on would receive a lower Q-value than actions directly leading to the desired item.

The Q-values are then used to train a Q-network (QNet), which predicts the expected cumulative reward given the current state and action. During inference, the agent uses the QNet to select actions maximizing the predicted Q-value, leading to more effective and efficient search strategies. Even with limited labelled data (around half the amount), the paper shows QLASS consistently outperforms other methods.

QLASS in Action: Beyond Webshops

The researchers evaluated QLASS on three different environments: WebShop, ALFWorld (a simulated household environment), and SciWorld (a simulated scientific experimentation environment). Across all three, QLASS demonstrated significant performance gains compared to baselines which relied only on the final reward or other self-improvement techniques. A key finding is that QLASS was particularly efficient at improving inference-time decision-making even with only a fraction of the annotated data required by alternative methods.

QLASS’s use of Q-values provides crucial step-wise feedback, enabling the language agents to learn to prioritize effective actions, avoid repetitive or unproductive actions, and ultimately achieve better results within a constrained budget of inference-time steps. Qualitative analyses of agent trajectories further support the intuition that QLASS leads to more coherent decision-making processes.

This work offers a significant advancement in training efficient and effective language agents by efficiently leveraging limited supervisory signals. The introduced methodology can be applied to a broad range of interactive tasks, paving the way for more robust and efficient AI agents operating in complex, real-world environments.