Your Computer is a Mess, and Your AI Assistant Can’t Handle It Yet
We have all been there: staring at a digital “haystack” of folders, trying to remember where we saved that one specific photo or which PDF contains the instructions for a visa application. While we’ve been promised AI assistants that can manage our digital lives, a new research paper from Nanyang Technological University and Synvo AI suggests that even the world’s most advanced AI models are currently failing the “personal computer” test.
The researchers have introduced HippoCamp, a massive new benchmark designed to see if AI agents can actually navigate the idiosyncratic, messy, and multimodal world of a personal file system. Named after the hippocampus—the part of the brain essential for memory—the benchmark moves away from generic web-searching tasks and into the deep, dark hierarchies of local hard drives.
The Digital Haystack
To build HippoCamp, the team constructed three “archetypal” digital environments based on real-world data, totaling 42.4 GB across over 2,000 files. These aren’t just text files; they include everything from calendar invites (.ics) and emails (.eml) to voice memos (.mp3), spreadsheets (.xlsx), and high-definition videos (.mp4).
The benchmark tests agents on two main fronts: factual retention and profiling.
Building Intuition: The “Visa Photo” Problem
To understand “factual retention,” imagine asking an AI: “Find a photo in my folders that meets the official Japanese visa requirements.”
To solve this, the AI can’t just search for the word “photo.” It must first locate and read a PDF titled “Photograph Standard.pdf” to learn the specific constraints (e.g., 45mm x 45mm, white background, frontal view). Then, it must “look” at various JPEGs in an “Identity” folder, checking their dimensions, background color, and the subject’s pose.
Currently, agents struggle here. They might find a photo of the user, but fail to verify if the background is truly white or if the dimensions are correct because their “perception”—the ability to connect a written rule to a visual attribute—is still brittle.
The “Wednesday Routine” Problem
The second task, “profiling,” is even harder. It requires the AI to synthesize patterns over time. For instance, if you ask, “What are my Wednesdays usually like?” the AI has to look across weeks of data.
It might find a calendar event for a “Legal Consultation” every Wednesday morning, a voice memo from a Wednesday afternoon where the user complains about a “long lab session,” and an email receipt for a “reward dinner” at a yakiniku restaurant every Wednesday night. By connecting these dots, the AI builds a profile of your life.
The researchers found a hilarious, if frustrating, failure mode here called “entity misattribution.” In one test, an AI looked at a user’s health logs and mistakenly concluded that the user was a cat named Shadow, attributing the cat’s vet visits and grooming schedule to the human owner.
Why Current AI Fails
Even the most powerful models, including ChatGPT’s Agent Mode, achieved only a 48.3% accuracy in profiling. The study identified three primary bottlenecks:
- Perception: Models struggle to “see” and “hear” across different file formats consistently.
- Grounding: AI often hallucinates file paths or ignores the specific evidence in your folders in favor of “generic” advice.
- Reasoning: Agents often fail to perform a final “verification” step to ensure their answer is actually backed up by the files they found.
The Path Forward
HippoCamp isn’t just a critique; it’s a roadmap. The researchers argue that the next generation of AI assistants needs “structure-aware search” that understands how folders are organized and a “verification loop” to double-check their work against your actual data. Until then, you might want to keep organizing your own folders—or at least make sure your AI doesn’t think you’re the cat.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.