AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Vision-Based Retrieval-Augmented Generation on Multi-Modality Documents

Large language models (LLMs) are impressive, but they struggle with hallucinations and updating their knowledge. Retrieval-augmented generation (RAG) addresses these issues by allowing LLMs to access external knowledge sources. Existing RAG systems work solely on text, ignoring crucial visual information in real-world documents. This limits their ability to accurately answer questions based on complex multi-modal documents, like textbooks or manuals.

To overcome this limitation, a new approach called VisRAG has been introduced. VisRAG utilizes vision-language models (VLMs) to directly process visual information like images and layouts present in documents, thereby retaining all data information. This differs from traditional RAG systems, which rely on parsed text, potentially losing information in the process.

How VisRAG Works

VisRAG uses a VLM-based retriever (VisRAG-Ret) and generator (VisRAG-Gen) to handle multi-modal documents. The retriever takes a document image as input instead of text, embedding it using a VLM to find relevant documents. The generator then leverages the retrieved image to generate an answer.

To address the challenge of VLMs that can only accept a single image, VisRAG offers two techniques:

VisRAG’s Advantages

Experiments show that VisRAG outperforms traditional RAG systems in both retrieval and generation:

Furthermore, VisRAG shows superior training data efficiency compared to baseline models. This suggests VisRAG is robust and generalizes well, making it a promising solution for RAG on multi-modal documents.

Real-World Implications

VisRAG has significant potential to revolutionize RAG applications in diverse fields:

Overall, VisRAG represents a significant advancement in RAG technology. It offers a more comprehensive approach to utilizing multi-modal information, paving the way for more accurate and efficient knowledge extraction from real-world documents.