AI’s Mirror Is Too Polite: New Benchmark Reveals Why LLMs Struggle to Simulate Real Humans

🔊

💬 Ask

For years, computer scientists have chased the “holy grail” of digital social science: a general-purpose user simulator. The idea is to create an AI “digital twin” that can predict how a person might react to a new app feature, which advertisement will catch their eye, or how they will respond to a frustrating customer service bot.

However, a new research paper titled “Towards Real-world Human Behavior Simulation” suggests that our current Large Language Models (LLMs) are still seeing the world through a “Utopian” lens. By introducing OmniBehavior, the first simulation benchmark built entirely on real-world data from the Kuaishou platform, researchers have exposed a significant gap between how AI thinks we behave and how we actually do.

The Xiaomi Causal Chain

To understand the complexity of human behavior, the researchers point to what they call “long-horizon, cross-scenario causal chains.”

Consider a typical user’s journey: On September 25, a user searches for a “Xiaomi Launch Event.” Over the next twelve days, they watch tech reviews in their video feed, click on a flagship store advertisement, and join a live stream to ask questions. Finally, on October 8, they add the phone to their cart and buy it.

The researchers found that 80% of such “conversion paths” span multiple scenarios and several days. Most existing AI benchmarks suffer from “tunnel vision,” looking only at a single session or a single app. When you isolate these moments, you lose the “causal integrity” of why a person does what they do.

The “Positivity-and-Average” Problem

The study evaluated state-of-the-art models, including Claude-4.5-Opus and GPT-5.2, across five scenarios: video browsing, live streaming, ads, e-commerce, and search. The results were humbling. Even the best-performing model, Claude-4.5-Opus, achieved an overall score of only 44.55 out of 100.

The researchers identified three structural biases that make LLMs poor mirrors of reality:

Hyper-activity: LLMs are “too eager.” While real human behavior is sparse (we rarely “like” or “share” every video we see), LLMs overestimate the probability of these actions by 40–60%.
Persona Homogenization: Instead of simulating a diverse population, LLMs tend to converge toward a “positive average person.” They struggle to maintain the distinct, quirky identities that characterize real individuals, resulting in a “blurring” of simulated populations.
Utopian Bias: This is perhaps the most glaring flaw in customer service simulations. When a package is weeks late, a real human might be irritable, direct, or even confrontational. In contrast, LLMs—trained to be helpful and harmless—remain unrealistically polite. They use “face-saving” language and hedging, failing to simulate the “adversarial interactions” that occur in the real world.

Why It Matters

The failure of LLMs to simulate the “long-tail” of human behavior—the grumpy customer, the silent lurker, or the impulsive buyer—means they cannot yet serve as reliable stand-ins for real people in industrial testing.

The OmniBehavior benchmark provides a much-needed reality check. As the researchers conclude, until we address these intrinsic distortions, AI models will remain “mirrors of our ideals rather than maps of our reality.” For those looking to build the next generation of personalized tech, the message is clear: the path to better AI isn’t just more data, but more human data—flaws and all.

AI Papers Reader

Personalized digests of latest AI research

AI’s Mirror Is Too Polite: New Benchmark Reveals Why LLMs Struggle to Simulate Real Humans

The Xiaomi Causal Chain

The “Positivity-and-Average” Problem

Why It Matters

Chat about this paper