AI Papers Reader

Personalized digests of latest AI research

View on GitHub

The "Sandbox" Test: New Benchmark Reveals Why AI Research Agents Still Hallucinate

The next frontier of artificial intelligence isn’t just answering questions; it is conducting “deep research.” These autonomous systems, known as Deep Research Agents (DRAs), are designed to plan investigations, sift through massive amounts of data, and write citation-heavy reports. However, a new study from researchers at Nanjing University and other institutions suggests that even the world’s most advanced AI models still struggle when faced with the messy, “noisy” reality of genuine research.

The paper, titled DR³-Eval, introduces a new benchmark designed to solve a fundamental problem in AI development: the “moving target” of the internet. Most current AI tests either use a “clean” set of documents or allow the AI to browse the live web. The former is too easy, while the latter is impossible to replicate because the web changes every minute.

The Digital Time Capsule

To bridge this gap, the researchers created a “static research sandbox.” Think of it as a digital time capsule that simulates the chaos of the open internet. For each task, the AI is dropped into a controlled environment containing up to 500 web pages. Some of these pages are “signals” (the right answer), but many more are “distractors” (outdated or one-sided info) and “noise” (irrelevant filler).

The benchmark is grounded in “multimodal” reality. In a typical DR³-Eval task, an agent might be given a set of files that a human researcher would actually use: a high-resolution JPG of a railway map, a 3-minute MP4 video of a technical presentation, and a complex Excel spreadsheet.

For example, a task might ask the agent to “summarize the evolution of China’s high-speed rail compared to Japan’s Shinkansen.” To succeed, the AI cannot simply rely on its internal memory. It must “watch” the provided video to catch specific technical milestones, “read” the map to understand geographic density, and filter through hundreds of sandbox web pages to find credible citations while ignoring “noisy” blog posts about unrelated travel tips.

The “Noise” Problem

When the researchers tested state-of-the-art models like Claude 4 and Gemini 2.5 Pro, the results were humbling. One of the most striking findings was that longer context—the ability for an AI to “read” more pages at once—actually led to worse performance. As the sandbox grew from 64,000 to 512,000 tokens of information, the agents became overwhelmed by the noise, leading to a “general drop in performance” across all models.

The study identified a critical failure mode: “hallucination control.” Even when models followed instructions perfectly and looked at the right documents, they frequently fabricated facts or failed to cite their sources correctly. In one analysis, hallucination remained the primary cause of failure, suggesting that while AI is getting better at finding information, it still struggles to use it without making things up.

Why It Matters

For the tech industry, DR³-Eval provides a much-needed reality check. It moves evaluation away from “vibe-based” checking toward a verifiable, “reverse-constructed” methodology where every question has a definitive, grounded solution path.

As we move toward a world where AI agents handle our market reports and scientific literature reviews, the DR³-Eval team warns that “looks” can be deceiving. A report that looks professional and follows every formatting rule can still be factually hollow—a gap that this new benchmark is determined to close.