AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI’s Hoarding Problem: Why Large Language Models Struggle with Evolving Memories

Imagine telling a personal AI assistant in January that you love coffee. In March, you cut back on caffeine and switch to green tea. By September, you allow yourself decaf espresso. If you later ask that assistant, “What was my preferred drink right before I switched to green tea?” a human friend would easily recall “coffee.”

For today’s state-of-the-art artificial intelligence, however, this simple temporal hop is a recipe for complete confusion.

This psychological hurdle is known as memory “interference”—a phenomenon where old and new memories clash and disrupt retrieval. To address this issue, a research team from UNC Chapel Hill and the University of Texas at Austin has introduced MINTEVAL (Long-Horizon Memory under INTerference Evaluation), a new benchmark designed to stress-test how well AI systems handle messy, ever-changing digital environments.

The findings are a wake-up call for the AI industry: when information is dynamic, even our most advanced systems fall short.

Testing AI in a Changing World

Existing benchmarks typically evaluate AI memory using static, independent facts. MINTEVAL shifts the playing field to realistic, long-horizon scenarios where information is continually updated, revised, or contradicted.

The benchmark spans 15.6k question-answering pairs across four practical domains:

  • State Tracking: Sequential, symbolic changes to simple facts.
  • Multi-Turn Dialogue: Tracking evolving user preferences over months of chat history.
  • Wikipedia Revisions: Chronicling how factual articles are edited, corrected, and updated over time.
  • GitHub Commits: Tracking complex code modifications across chronological software updates.

The contexts involved are massive, averaging 138.8k tokens (roughly the length of a 400-page novel) and scaling up to 1.8 million tokens.

To navigate these histories, AI agents must answer two types of questions. The first is single-target recall, such as “lookback” queries (e.g., “What was the building’s floor count in the version of the article two edits prior?”). The second is multi-target aggregation, which requires processing multiple pieces of evidence (e.g., “How many different producers have been listed for this album across all revisions?”).

The Digital Hoarder Bottleneck

The researchers evaluated seven prominent AI setups, including vanilla long-context LLMs, retrieval-augmented generation (RAG) systems, and specialized memory-augmented agents.

The results were sobering: the average accuracy across all evaluated systems was just 27.9%. Even MemAgent, the top-performing system, managed only 33.4% accuracy.

Crucially, the study revealed that the breakdown does not happen during the final reasoning stage. Instead, the primary bottleneck is “memory construction.” AI memory managers simply failed to retrieve or preserve the correct historical evidence in 41.7% of cases.

Furthermore, the researchers discovered that modern AI agents behave like digital hoarders. When analyzing how these systems manage their memory databases, the study found they are heavily biased toward “insertion”—constantly piling new facts onto the stack. They rarely “modify” existing entries and almost never “delete” outdated information. As obsolete facts accumulate, they interfere with the AI’s retrieval systems, causing performance to plummet as lookback distances increase.

For AI agents to successfully assist humans in long-term projects like software development or lifelong personal coaching, developers must build smarter mental closets. Future AI systems cannot just be given larger filing cabinets; they must learn how to clean them.