The Clarification Catalyst: How InfoPO Teaches AI Agents to Ask the Right Questions
Large Language Models (LLMs) are increasingly being deployed as “agents”—AI assistants capable of booking flights, writing code, or troubleshooting technical issues. However, these agents face a persistent “mind-reading” problem: human requests are often frustratingly vague. If you tell an AI to “book a flight next week,” it cannot act until it knows your budget, destination, and preferred departure airport.
While developers use Reinforcement Learning (RL) to train these agents, current methods often struggle with “credit assignment.” If an agent has a ten-turn conversation and ultimately fails, the training algorithm often penalizes every single turn equally, even if the agent asked brilliant clarifying questions early on.
To bridge this gap, a team of researchers from Peking University and other institutions has introduced InfoPO (Information-Driven Policy Optimization). This new framework, detailed in a paper recently released on arXiv, provides a principled way to reward AI agents for “uncertainty reduction”—the act of asking the right questions at the right time.
Rewarding the “Aha!” Moment
The core innovation of InfoPO is a turn-level reward based on “counterfactual information gain.” To understand this, imagine an agent helping a user with a coding project.
In the factual world, the user tells the agent, “The data is stored in a nested dictionary.” The agent then realizes it needs to use a specific type of recursive loop.
In the counterfactual world, InfoPO’s training system “masks” that specific piece of feedback, replacing it with a placeholder like “No information found.” It then asks: How much would the agent’s next action change if it hadn’t heard that the data was a nested dictionary?
If the difference between the two scenarios is large, it means the user’s feedback was highly informative, and the agent is rewarded for the action that elicited it. This “turn-level” granularity allows the agent to learn that asking “What is the data structure?” is a high-value move, even if the final code happens to have a bug later.
Balancing Curiosity and Execution
Asking questions is only half the battle; the agent must eventually finish the task. InfoPO handles this through an “adaptive variance-gated fusion.”
In the early stages of a task, when the agent is confused and the final success signal is “sparse” (meaning the agent is failing most of its attempts), InfoPO cranks up the reward for gathering information. This prevents the agent from “stagnating” or repeating the same failed actions. As the agent becomes more successful, the system automatically shifts its focus toward the final goal, ensuring the agent doesn’t become a “professional interviewer” who never actually gets the job done.
Proven Success in Complex Arenas
The researchers tested InfoPO across three diverse benchmarks:
- UserGym: A suite for travel planning and goal inference.
- ColBench: A collaborative programming environment.
- $\tau^2$-Bench: A long-horizon troubleshooting task for airlines and retail.
Across these tasks, InfoPO consistently outperformed existing RL baselines by 14% to 16%. Qualitatively, the researchers observed an “explore-then-consolidate” pattern. Trained agents became more proactive, resolving ambiguity early in the conversation before committing to a final answer.
By treating information as a measurable resource, InfoPO moves AI agents away from “best-guess” responses and toward a more human-like, collaborative style of problem-solving.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.