Beyond Text Snippets: New "Chain of Evidence" AI Pinpoints Facts Directly on Document Screenshots

🔊

💬 Ask

If you have ever asked an AI a complex question, you have likely encountered the “verification bottleneck.” The AI provides an answer and cites a 50-page PDF as its source. To ensure the AI isn’t hallucinating, you are forced to hunt through those pages to find the specific sentence or chart that supports the claim.

Researchers at Peking University and the City University of Hong Kong have proposed a solution to this tedious process. In a new paper, they introduce Chain of Evidence (CoE), a framework that allows AI to “see” documents as humans do—as visual screenshots—and point to exactly where it found each piece of information using precise bounding boxes.

The Limits of “Reading” Text

Most current AI systems use a process called Retrieval-Augmented Generation (RAG). To answer a question, the AI searches a database, pulls out relevant text, and summarizes it. However, this method has a major flaw: it usually relies on “parsing,” where visually rich documents like PowerPoint slides, financial reports, or academic papers are stripped of their layout and converted into plain, linear text.

When you convert a complex flowchart or a bar graph into plain text, the “visual semantics”—the logic of the arrows or the height of the bars—are often lost. As the researchers note, if an AI is asked about a trend in a chart, it might find the numbers but fail to understand the relationship between them because the “evidence” exists in the visual layout, not just the words.

How Chain of Evidence Works

CoE changes the paradigm by using Vision-Language Models (VLMs). Instead of reading text files, the AI “looks” at screenshots of document pages. This allows the system to perform what the researchers call “pixel-level visual attribution.”

To understand how this builds a reasoning chain, imagine asking the AI: “Which university did the director of the film Inception attend?”

A traditional system might struggle if it has to hop between documents. CoE, however, performs an iterative process:

Hop 1: It identifies a document about the movie Inception, identifies Christopher Nolan as the director, and draws a red box around his name on the page.
Hop 2: Using that name, it retrieves a biography of Nolan, finds the section on his education, and draws another box around “University College London.”

The result isn’t just a text answer; it is a visual audit trail. The user sees a sequence of images with highlighted regions, making verification near-instant.

Navigating the “Visual Jungle”

The researchers tested CoE against two challenging benchmarks. The first, Wiki-CoE, features over 70,000 questions based on Wikipedia’s structured layouts. The second, SlideVQA, is even more difficult, consisting of presentation slides filled with non-linear layouts, diagrams, and charts.

In scenarios where information was buried in complex diagrams, traditional text-based AI models saw their performance plummet. Because text parsers ignore “visual connectors” like arrows in a flowchart, they couldn’t follow the logic. CoE, by contrast, maintained robust performance. In their experiments, a fine-tuned version of the Qwen3-VL-8B model achieved 80.4% accuracy in localizing evidence, significantly outperforming much larger models that lack this visual-first training.

Why It Matters

As AI is increasingly used in high-stakes fields like law, medicine, and finance, the ability to “show your work” is becoming as important as the answer itself. By moving away from brittle text conversion and toward a visual-first “Chain of Evidence,” this framework provides a blueprint for AI that is not only smarter but significantly more transparent.

AI Papers Reader

Personalized digests of latest AI research

Beyond Text Snippets: New "Chain of Evidence" AI Pinpoints Facts Directly on Document Screenshots

The Limits of “Reading” Text

How Chain of Evidence Works

Navigating the “Visual Jungle”

Why It Matters

Chat about this paper