New AI Benchmark Reveals Major Flaw in Multimodal Models: They Cannot Reason Across Multiple Videos
Researchers introduce MVU-Eval, the first multi-video evaluation platform, showing state-of-the-art MLLMs fail complex tasks required for real-world scenarios like autonomous driving and sports analytics.
A new comprehensive benchmark, MVU-Eval, has been unveiled by a team of researchers from institutions including Nanjing University and CASIA, exposing a significant weakness in current Multimodal Large Language Models (MLLMs). While modern MLLMs excel at interpreting single images or videos, the new research demonstrates they severely struggle when asked to understand and reason across multiple, distinct video streams simultaneously.
The researchers argue that existing benchmarks overlook the critical need for multi-video understanding—a capability essential for real-world applications. For instance, autonomous vehicles rely on combining inputs from dozens of disparate camera angles, and professional sports analysts synthesize data from various cross-angle feeds.
To address this gap, MVU-Eval introduces the first benchmark specifically designed for Multi-Video Understanding. It comprises 1,824 meticulously curated question-answer pairs spanning nearly 5,000 videos from diverse domains, including movies, gaming, sports, and autonomous driving.
MVU-Eval systematically assesses eight core competencies, divided into two protocols: Perception and Reasoning.
Building Cross-Video Intuition
The benchmark forces models to integrate information that cannot be gathered from a single clip.
For Perception tasks, models must perform functions like Spatial Understanding (SU) across multiple complementary camera views—for example, interpreting the precise spatial layout of traffic around a moving car using feeds from different onboard cameras. Another key task, Counting, requires models to aggregate transient objects across asynchronous videos (e.g., determining which of three separate clips contains the most chairs).
For higher-order Reasoning tasks, the complexity increases. Knowledge-Intensive Reasoning (KIR) demands combining visual evidence with specific domain knowledge, such as judging the technical difficulty score of a diving action shown across six separate video angles. In-Context Learning (ICL) challenges the model to infer patterns from previous video-QA pairs and apply that learned logic to a new, unseen video clip.
The Performance Gap
Initial evaluations of 26 state-of-the-art MLLMs—including both open-source and closed-source leaders—show a stark difference between human and machine performance.
The top-performing commercial model, Gemini 2.5 Pro, achieved only 58.4% overall accuracy on MVU-Eval, highlighting the challenging nature of the benchmark and the substantial room for improvement. Most open-source models scored below 50%. Human experts, in comparison, achieved an accuracy of 93.6%.
The analysis also revealed that model capabilities are imbalanced; a model strong in Object Recognition might fail at Spatial Understanding. Consistent with general AI trends, larger models generally exhibited better performance, but scaling alone was often insufficient to solve complex cross-video reasoning hurdles.
The findings underscore that current MLLMs lack the effective inter-video fusion and alignment mechanisms necessary to aggregate and correlate information from unaligned, high-cardinality video inputs. By providing this rigorous, publicly available benchmark, the researchers aim to provide a clear roadmap for developing the next generation of truly robust multimodal AI systems.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.