Forget-Me-Not: Giving AI Agents a 'Git-Like' Memory to Tackle Evolving Realities
Large language model (LLM) agents are incredibly capable when dropped into static, predictable environments. But when they enter the messy, shifting landscape of the real world—where software updates break old code and human users change their minds—these systems quickly fall apart.
The root of the problem is a phenomenon known as “state collapse.” When an AI agent learns something new, it typically overwrites its old memory bank with the latest state. It is the equivalent of a digital notepad that gets completely erased and rewritten every time an update occurs. This works fine until the agent needs to understand why a change happened, or has to deal with a system rollback.
To address this, a team of researchers from the National University of Singapore, MIT, and other leading institutions has introduced EvoArena, a new benchmark designed to test AI agents in continuously evolving environments. Along with it, they developed EvoMem, a lightweight, “Git-like” memory system that tracks changes as structured history patches.
A Moving Target
Consider a virtual personal assistant helping a user with grocery shopping. Initially, the user tells the assistant, “I want simple, predictable staples from mainstream supermarkets.” Later, they decide to experiment, telling the AI, “I want to branch out and try sumac, tahini, and chili crisp from the international market.”
Under a standard memory model, the AI simply updates its record to the new preference. However, if asked to plan a quick, low-energy weekday meal, the agent might fail to realize that the user’s craving for complex international ingredients was specifically for weekend cooking projects. Because the AI overwrote its memory, it lost the context of the original preference.
To evaluate this kind of memory erosion, EvoArena tests agents across three dynamic domains: Terminal-Bench-Evo (changing computer terminal workflows), SWE-Chain-Evo (evolving software codebases), and PersonaMem-Evo (shifting human preferences).
In the terminal workflow test, an agent might be tasked with deploying a webpage. In the initial “Prototype” stage, it simply copies files manually. In “Version 1,” the workflow shifts to an automated git hook. By “Version 3,” strict security settings are introduced, requiring deployments to use specific group permissions. Without a way to track these sequential updates, standard AI agents fail, blindly trying obsolete methods because they cannot untangle what changed from what still holds true.
The “Git” Solution for AI
EvoMem solves this by taking inspiration from software version control systems like Git. Instead of erasing old memories, it appends a “patch” to a historical log. Each patch records:
- The old memory state.
- The new memory state.
- The rationale for why the change occurred.
- The triggering evidence.
When prompted, the agent defaults to its latest memory. But if it encounters a conflict or version-specific question, it can selectively retrieve historical patches to understand the evolution of its environment.
The results are striking. On EvoArena, standard agents struggled, averaging just 39.6% accuracy across evolving domains. Adding EvoMem consistently boosted performance, improving “chain-level” accuracy—the ability to successfully solve an entire sequence of related, changing tasks—by 3.7%. It also improved performance on standard benchmarks like GAIA and LoCoMo by up to 6.1%.
Ultimately, the research suggests that for AI to survive in a constantly shifting world, they cannot simply live in the present. To be truly reliable, they must be equipped to remember their own history.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.