Can AI Keep Up with a Changing World? New Benchmark Exposes "Knowledge Lag" in Top Models

🔊

💬 Ask

Imagine telling your AI assistant you’ve moved your coffee date from 2 PM to 3 PM, then later mentioning that the 3 PM slot is now occupied by a doctor’s visit. Does the AI still think the coffee date is happening? Does it get confused between the three different updates?

This ability to update information on the fly—known as “online adaptation”—is the focus of a new study by researchers from KAIST, Google, and Adobe. They argue that while Large Language Models (LLMs) are excellent at reciting static facts, they struggle significantly with “continual knowledge streams,” where information evolves or emerges incrementally. To address this gap, the team introduced OAKS (Online Adaptation to Continual Knowledge Streams), the first benchmark specifically designed to evaluate how AI tracks fine-grained, evolving facts over time.

The Challenge of Moving Targets

Most AI benchmarks are “static.” They give a model a huge pile of text and ask a question. OAKS is different; it functions like a news ticker. Information is revealed in “chunks,” and the model is asked the same set of questions at every single interval.

To build intuition for this, consider the researchers’ OAKS-BABI dataset. In a story, a character might place ten soldier figures on a dining table. A few pages later, one is smashed. Later still, four are moved to a shelf. If the model is asked at every step, “How many soldiers are on the table?”, it must not only find the numbers but also realize that the new information supersedes the old. It’s not just about memory; it’s about “state tracking”—keeping an accurate internal scoreboard of the world as it changes.

The researchers also developed OAKS-Novel, which uses human-curated literary texts. This tests more complex reasoning, such as tracking a character’s changing motivations or locations across the long, winding plot of a novel.

Inertia and Volatility

The study put 14 state-of-the-art models, including Google’s Gemini 3 and the open-source Qwen3, to the test. The results were sobering. On the synthetic tracking tasks, average accuracy was a mere 33.3%. Even the most powerful models struggled when facts updated frequently.

The team identified two primary ways AI fails to keep up:

Under-updating (Inertia): Some models are “stubborn.” They show a lag, sticking to an outdated answer even after the text explicitly provides an update.
Over-updating (Volatility): Other models are too sensitive. They get “distracted” by irrelevant details in the new text and change a correct answer to an incorrect one, even when the underlying fact hasn’t changed.

Why “Thinking” Isn’t Enough

The researchers also explored whether popular techniques could fix these issues. They found that “Thinking Mode”—where models like Gemini or Qwen3-Thinking show their internal reasoning process—improved multi-hop logic but didn’t fully solve the tracking problem.

Similarly, Retrieval-Augmented Generation (RAG)—a technique where the AI “looks up” relevant snippets of text—showed limitations. When a fact changes multiple times, RAG often retrieves both the old and the new information. The AI then struggles to distinguish which is the “current” truth and which is historical baggage.

As AI moves toward becoming “agentic”—operating as personal assistants or robots in dynamic environments—the OAKS benchmark highlights a critical hurdle. For an AI to be truly useful in the real world, it needs to do more than just remember; it needs to know when to forget.

AI Papers Reader

Personalized digests of latest AI research

Can AI Keep Up with a Changing World? New Benchmark Exposes "Knowledge Lag" in Top Models

The Challenge of Moving Targets

Inertia and Volatility

Why “Thinking” Isn’t Enough

Chat about this paper