Vision-Based Retrieval-Augmented Generation on Multi-Modality Documents

🔊

💬 Ask

Large language models (LLMs) are impressive, but they struggle with hallucinations and updating their knowledge. Retrieval-augmented generation (RAG) addresses these issues by allowing LLMs to access external knowledge sources. Existing RAG systems work solely on text, ignoring crucial visual information in real-world documents. This limits their ability to accurately answer questions based on complex multi-modal documents, like textbooks or manuals.

To overcome this limitation, a new approach called VisRAG has been introduced. VisRAG utilizes vision-language models (VLMs) to directly process visual information like images and layouts present in documents, thereby retaining all data information. This differs from traditional RAG systems, which rely on parsed text, potentially losing information in the process.

How VisRAG Works

VisRAG uses a VLM-based retriever (VisRAG-Ret) and generator (VisRAG-Gen) to handle multi-modal documents. The retriever takes a document image as input instead of text, embedding it using a VLM to find relevant documents. The generator then leverages the retrieved image to generate an answer.

To address the challenge of VLMs that can only accept a single image, VisRAG offers two techniques:

Page concatenation: Combines multiple retrieved pages into a single image for input to VLMs that accept only one image at a time.
Weighted selection: Queries the VLM for each page separately and combines their outputs based on their relevance to the query, yielding a final answer.

VisRAG’s Advantages

Experiments show that VisRAG outperforms traditional RAG systems in both retrieval and generation:

Retrieval: VisRAG-Ret surpasses text-based retrieval models and state-of-the-art vision-centric models, achieving better results than relying solely on vision or language encoders.
Generation: VisRAG-Gen outperforms TextRAG generators, particularly with VLMs that can handle multiple images. This indicates VisRAG’s effectiveness in leveraging multi-page information.

Furthermore, VisRAG shows superior training data efficiency compared to baseline models. This suggests VisRAG is robust and generalizes well, making it a promising solution for RAG on multi-modal documents.

Real-World Implications

VisRAG has significant potential to revolutionize RAG applications in diverse fields:

Education: Answering questions from textbooks and manuals containing images and diagrams.
Healthcare: Analyzing medical reports with images and patient records.
Legal: Extracting information from legal documents with images and text.

Overall, VisRAG represents a significant advancement in RAG technology. It offers a more comprehensive approach to utilizing multi-modal information, paving the way for more accurate and efficient knowledge extraction from real-world documents.

AI Papers Reader

Personalized digests of latest AI research

Vision-Based Retrieval-Augmented Generation on Multi-Modality Documents

Chat about this paper