AI Papers Reader

Personalized digests of latest AI research

View on GitHub

When AI Remembers Too Much: New Benchmark Reveals LLMs Struggle to Read the Room

Imagine you’ve told your AI assistant that you enjoy a bit of dry sarcasm, love using emojis, and prefer to be addressed by your nickname, “Joker.” For your daily casual chats, this works perfectly. But when you ask that same AI to draft a formal letter to the IRS to resolve a tax discrepancy, you probably don’t want the message to start with: “Hey there, Financial Wizard! Hope you’ve got your golden star stickers ready because today’s lesson is all about fixing that little tax ‘oopsie’!”

This phenomenon—where an AI’s memory of your personality clashes with the social requirements of a task—is the focus of a new research paper titled “BenchPreS.” Researchers from Yonsei University and LG AI Research have developed a benchmark to evaluate how well Large Language Models (LLMs) can selectively apply or suppress user preferences stored in their persistent memory.

The results? Even the world’s most advanced AI models are remarkably bad at “reading the room.”

The Challenge of Selective Memory

As AI assistants move toward having “persistent memory”—the ability to remember facts and styles across different conversations—the goal is deep personalization. However, the researchers argue that true intelligence isn’t just about remembering everything; it’s about knowing when to forget.

The BenchPreS benchmark tests models across 39 different scenarios in domains like finance, law, and health. It measures two key metrics: the Appropriate Application Rate (AAR), or how often the AI uses a preference when it should, and the Misapplication Rate (MR), or how often it uses a preference when it is socially or professionally inappropriate.

Concrete Failures in “Social Intelligence”

The study provides several striking examples of current AI failures. In one instance, a model was tasked with writing to a bank loan officer. Despite the formal context, the AI used the user’s preferred nickname, “Rambo,” and adopted a “comedian perspective,” describing a rental history as looking “as empty as a salad bar at a donut convention.”

The researchers found a consistent trend: as models get better at following instructions, they actually get worse at context-aware selectivity. Instead of treating a user’s preference for emojis or sarcasm as a “hint” to be used when appropriate, models tend to treat them as “globally enforceable rules.” If the memory says “use bold text,” the AI uses bold text everywhere—from a grocery list to a legal document.

The “Thinking” Trap

Perhaps the most surprising finding involves the new wave of “reasoning” models. One might assume that a model that “thinks” before it speaks would realize a sarcastic tone is a bad idea for a court filing. Instead, the researchers found that reasoning often makes the problem worse.

In failure cases, the models’ internal thought traces showed them treating the user’s preferences like a mandatory checklist. One model specifically noted that the “school newsletter format” was inappropriate for a government document, but then proceeded to use it anyway because it viewed the user’s preference as a “key requirement” that had to be satisfied.

Moving Toward Context-Aware AI

The paper concludes that current training paradigms prioritize “preference adherence” at any cost. For AI to become truly useful agents in professional settings, they must move beyond being mere mimics of user style. They need to develop a sense of “contextual integrity”—the ability to weigh a user’s personality against the norms of the world.

Until then, users might want to be careful what they ask their AI to remember; otherwise, “Joker” might just show up to their next mortgage application.