Beyond the Polished Prose: MiroEval Unveils the "Traceability Gap" in AI Deep Research

🔊

💬 Ask

The latest generation of AI “Deep Research” agents—systems designed to spend minutes or even hours scouring the web to produce comprehensive reports—can be remarkably persuasive. However, a new study from researchers at MiroMind, the National University of Singapore, and Nanyang Technological University suggests that we have been judging these digital researchers by the wrong criteria. By focusing only on the final, polished report, we often miss a “traceability gap” where AI agents invent conclusions that their own research never actually supported.

To solve this, the team introduced MiroEval, a new benchmarking framework designed to audit not just what an AI produces, but how it conducts its investigation. Unlike traditional benchmarks that use fixed rubrics, MiroEval evaluates 13 leading systems—including OpenAI Deep Research and Gemini-3.1-Pro—across 100 complex tasks grounded in real-world user needs.

Auditing the “Black Box” of Research

MiroEval moves beyond simple text-matching. It evaluates agents along three dimensions: Synthesis Quality (how well the report is written), Factuality (checking claims against live web sources and uploaded files), and Process Quality (the logic of the search path itself).

The most significant innovation is the “Process-Centric” evaluation. In deep research, a system might follow a redundant or shallow path but still generate a “plausible-looking” report. MiroEval audits the “search trajectory,” measuring whether the agent actually displayed critical thinking or simply hit a “hallucination wall” when evidence was thin.

The Problem of “Fabricated Fidelity”

The researchers provide concrete examples of why this process-level audit is necessary. In one test case, a user uploaded a screenshot of a “Fastest-Growing Software Vendors” list and asked for an analysis. The image contained names and ranks but no specific financial figures.

While the top-performing model, MiroThinker-H1, correctly identified the limits of the data, a rival agent (ChatGLM) fabricated specific growth rates (e.g., “Kling AI 1900%”) that appeared nowhere in the source. Because the final report looked professional and included specific numbers, a standard evaluation might have graded it highly. MiroEval’s “Grounding” layer caught the deception by realizing the agent was inferring—rather than researching—the data.

In another case involving veterinary nutrition, models were asked to compare cat foods based on photos of cans. Many models “invented” nutrient values for products where the label wasn’t visible in the photo. MiroEval’s adaptive rubric specifically penalized this “uncertainty-governance” failure, rewarding models that admitted when data was missing.

Key Findings: Multimodal is Harder

The study revealed three major insights:

Process Predicts Outcome: A disciplined research process is a reliable predictor of a factual report. Systems that “search wide but fail to go deep” often produce reports that look good but lack analytical substance.
The Multimodal Bottleneck: When tasks required analyzing attachments like PDFs, spreadsheets, or images, performance across nearly all systems dropped by 3 to 10 points.
The Traceability Gap: Many systems introduced major conclusions in their final reports that could not be traced back to their documented search steps.

Ultimately, MiroThinker-H1 emerged as the most balanced performer, maintaining high factual discipline even in multimodal settings. As AI agents move into high-stakes fields like finance and medicine, MiroEval provides a necessary diagnostic tool to ensure our digital researchers aren’t just good writers, but honest investigators.

AI Papers Reader

Personalized digests of latest AI research

Beyond the Polished Prose: MiroEval Unveils the "Traceability Gap" in AI Deep Research

Auditing the “Black Box” of Research

The Problem of “Fabricated Fidelity”

Key Findings: Multimodal is Harder

Chat about this paper