Enhancing Document Layout Analysis with DocLayout-YOLO
đź“„ Full Paper
đź’¬ Ask
Document layout analysis (DLA) is a crucial component of many document understanding systems, but there’s a constant tension between speed and accuracy. Multimodal methods, which leverage both text and visual features, achieve higher accuracy but are computationally expensive. Unimodal methods, which rely solely on visual features, are faster but less accurate.
A new paper from the Shanghai Artificial Intelligence Laboratory proposes a novel approach called DocLayout-YOLO that attempts to bridge this gap. DocLayout-YOLO achieves high accuracy while maintaining speed by incorporating two key innovations:
1. DocSynth-300K: A massive, diverse synthetic dataset for document layout analysis.
- This dataset addresses the limitations of existing datasets, which are often limited to single document types.
- DocSynth-300K is synthesized using a novel algorithm called “Mesh-candidate BestFit” that leverages principles from the two-dimensional bin-packing problem to generate diverse layouts.
2. Global-to-Local Controllable Receptive Module (GL-CRM): This module allows DocLayout-YOLO to effectively handle multi-scale variations within documents.
- The module extracts features at different scales and granularities, enabling the model to learn to fuse these features autonomously.
- This hierarchical approach, which goes from global context (whole-page scale) to local semantics, helps improve accuracy.
For example, imagine a document with both a small, single-line title and a large, full-page table. DocLayout-YOLO, with the help of GL-CRM, can effectively detect both elements despite their drastically different scales.
The researchers evaluated DocLayout-YOLO on several downstream datasets, including D4LA, DocLayNet, and their own benchmark called DocStructBench, which includes complex documents from various real-world scenarios.
The results show that DocLayout-YOLO significantly outperforms existing state-of-the-art methods in both speed and accuracy:
- It achieves an mAP (mean Average Precision) of 78.8% on DocStructBench.
- It achieves an inference speed of 85.5 frames per second (FPS).
This advancement has the potential to improve the efficiency of a wide range of document understanding applications, including document parsing, information extraction, and retrieval-augmented generation.
DocLayout-YOLO’s impressive performance highlights the power of synthetic data and specialized model architectures for addressing complex problems in document layout analysis. The code, data, and models are available at https://github.com/opendatalab/DocLayout-YOLO.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.