Your AI Assistant Still Doesn't "Get" You—But This New Benchmark Might Change That

🔊

💬 Ask

Imagine asking your future mobile AI assistant to simply “order lunch.” A truly helpful assistant would already know you’re allergic to peanuts, prefer the faster delivery app over the cheaper one, and always want a sugar-free cola. It might even proactively silence your 7:00 AM alarm on a holiday without being asked.

Today’s AI agents, however, are far from that level of intuition. While they have become remarkably good at following explicit instructions like “Open the Calendar app,” they often stumble when faced with the ambiguity of real human life. To bridge this gap, researchers from Zhejiang University, in collaboration with Apple and Tencent, have introduced KnowU-Bench, a new framework designed to test if AI agents can truly understand, interact with, and anticipate a user’s needs.

Beyond “Point and Click”

Most existing benchmarks for AI agents focus on “GUI navigation”—essentially, can the AI find and click the right buttons? If the instruction is clear, many modern models excel. But KnowU-Bench, which operates within a live Android emulation environment, introduces three difficult new dimensions: personalization, interaction, and proactivity.

To build intuition for these ideas, consider the “canned cola” example featured in the paper. In a general task, an agent is told exactly what to buy and where. In a personalized task, the instruction is vague: “I want a case of canned cola.” The agent must then dig through the user’s simulated history to realize they prefer a specific brand from a specific store.

If the history is still unclear, the agent must perform interactive preference acquisition. Instead of guessing blindly, it should ask: “Do you want the 24-pack of Diet Coke or the 12-pack of Zero Sugar?” Finally, proactivity tests whether the agent knows when to intervene—like blocking a suspicious spam text—and, crucially, when to stay silent if no routine applies.

The “User Simulator” Strategy

One of the most innovative features of KnowU-Bench is its LLM-driven user simulator. Unlike older benchmarks that use static files, KnowU-Bench pairs the AI agent with a simulated “human” holder of a secret profile. This profile contains the user’s habits, social circles, and even “pain points,” such as a dislike for unnecessary phone calls.

When an agent is unsure of what to do, it can “ask” the user simulator for clarification. This creates a realistic dialogue where the AI must prove it can translate human feedback into technical actions within the phone’s interface.

A Striking Reality Check

The researchers tested 11 state-of-the-art models, including frontier models like Claude Sonnet 4.6 and Gemini 3.1 Pro. The results revealed a significant “intelligence gap.” While the best models achieved nearly 100% success on simple, explicit tasks, their performance plummeted below 50% when instructions were vague or required proactive decision-making.

The study found that for a top-tier model like Claude, a staggering 93.8% of its personalized task failures were due to “insufficient clarification.” Essentially, the AI struggled to ask the right questions or failed to connect the user’s past behavior to the current request. In proactive scenarios, agents were more likely to “over-act”—launching tasks the user didn’t want—than to remain helpfully silent.

The takeaway for the tech industry is clear: the bottleneck for AI assistants is no longer the ability to navigate a screen. The next frontier is “intervention calibration”—the delicate art of knowing exactly when, how, and why to help a human. KnowU-Bench provides the roadmap for that journey.

AI Papers Reader

Personalized digests of latest AI research

Your AI Assistant Still Doesn't "Get" You—But This New Benchmark Might Change That

Beyond “Point and Click”

The “User Simulator” Strategy

A Striking Reality Check

Chat about this paper