Why Your AI Still Doesn’t "Get" You: New Benchmark Challenges AI to Get Personal

🔊

💬 Ask

The quest to align Artificial Intelligence with human values has hit a persistent roadblock: humans don’t all value the same things. While today’s Large Language Models (LLMs) are excellent at being generally helpful and polite, they often struggle to cater to the specific, idiosyncratic needs of individual users.

In a new paper, researchers from the University of California, Davis, introduce Personalized RewardBench, a rigorous new framework designed to measure how well AI “reward models”—the internal judges used to train bots—actually understand personal preference. Their findings suggest that even the world’s most advanced AI systems are currently hitting a “personalization bottleneck.”

The Problem with “General” Goodness

To train an AI like ChatGPT, developers use a process called Reinforcement Learning from Human Feedback (RLHF). At its heart is a “reward model” (RM), a digital critic trained to predict which of two responses a human would prefer.

Until now, these critics have been graded on their ability to spot universal qualities: Is the answer factually correct? Is it free of hate speech? Is it formatted well? But as the UC Davis team points out, a response can be perfectly “correct” and yet totally useless to a specific user.

Imagine a PhD student who is bored with their research. A generic AI might give excellent, high-quality advice on how to find a new topic. However, if that specific student has a history of seeking their supervisor’s guidance and values “open communication,” a response that ignores the supervisor relationship is a failure—no matter how well-written it is.

A New Yardstick for Personalization

Personalized RewardBench tests this by creating “chosen” and “rejected” response pairs where the only difference is adherence to a user’s personal history and specific rubrics.

Crucially, the researchers ensured the “rejected” answers weren’t “bad” in the traditional sense. In fact, human evaluators rated both the chosen and rejected answers nearly identical in terms of factuality and helpfulness. The rejection was based solely on a violation of the user’s personal “rubric”—their unique set of constraints and preferences.

To help the AI models, the researchers introduced a “Planner” module. This tool analyzes a user’s past interactions and translates them into a clear set of instructions—essentially a “to-do list” for the reward model. For example, if a user’s history shows they prefer concise, technical summaries over friendly, conversational ones, the Planner explicitly tells the reward model to look for “technical depth” and “brevity.”

The Personalization Gap

The results were a wake-up call for the industry. Even “frontier” models like Gemini-3-Flash and GPT-5.1 struggled with the benchmark, with the best-performing model peaking at an accuracy of just 75.94%.

The study also debunked a common myth in AI development: that bigger is always better. The researchers found that simply increasing the number of parameters in a model didn’t consistently improve its ability to handle personalization. In some cases, smaller, more specialized models actually outperformed their massive counterparts.

Why It Matters

If a reward model can’t tell the difference between what a specific user wants and what a “general” user wants, the AI it trains will remain a generic tool rather than a personal assistant.

By providing a benchmark that more accurately predicts how an AI will perform in the real world, the UC Davis team has provided a roadmap for the next generation of “pluralistic” AI—systems that don’t just follow the rules of the crowd, but actually learn the language of the individual.

AI Papers Reader

Personalized digests of latest AI research

Why Your AI Still Doesn’t "Get" You: New Benchmark Challenges AI to Get Personal

The Problem with “General” Goodness

A New Yardstick for Personalization

The Personalization Gap

Why It Matters

Chat about this paper