AI Papers Reader

Personalized digests of latest AI research

View on GitHub

ViDiC: New AI Benchmark Forces Multimodal Models to Reason About Dynamic Video Edits

In a significant step toward making AI models true video experts, researchers have introduced the Video Difference Captioning (ViDiC) task and the ViDiC-1K benchmark. The new system challenges Multimodal Large Language Models (MLLMs) to do more than just describe what’s happening in a video—they must precisely articulate the similarities and differences between two dynamic video clips, identifying everything from subtle motion changes to sophisticated camera work.

Until now, vision-language models were largely tested on Image Difference Captioning (IDC), which compares static snapshots. This approach is blind to the temporal dynamics essential for understanding real-world video editing and forensics. ViDiC bridges this gap by requiring models to generate natural language descriptions that capture changes in composition, spatial arrangement, and temporal flow across video pairs.

The ViDiC-1K dataset comprises 1,000 curated video pairs, encompassing both real-world footage and synthetically generated edits, allowing for precise control over variations. The differences are categorized across seven key dimensions: Subject, Style, Background, Camera Work, Motion, Position, and Playback Technique.

The Difficulty of Dynamics

The true difficulty of ViDiC lies in its fine-grained temporal challenges. For instance, models must distinguish between two seemingly identical videos where the difference is a Playback Technique (e.g., Video A plays normally, while Video B is slightly slowed down or reversed), or a minute Camera variation (Video A is static, but Video B performs a slow zoom-in).

To ensure rigorous evaluation beyond traditional text-matching scores like BLEU or CIDEr, the researchers developed a novel Dual-Checklist framework paired with an “LLM-as-a-Judge” protocol (using a model like GPT-5 Mini). The human-validated checklist converts comparison data into hundreds of binary (yes/no) questions, allowing the judge model to quantify the factual accuracy of the generated captions without needing access to the video pixels themselves.

Performance Gaps and Trade-offs

Initial testing on nineteen representative MLLMs revealed significant limitations, particularly in understanding technical video manipulations. Models struggled most with the Camera Work and Playback Technique categories, indicating a fundamental weakness in temporal artifact identification.

The evaluation also exposed a critical “Similarity and Difference Trade-off.” Models often demonstrated high Similarity scores—meaning they accurately identified what hadn’t changed—but scored poorly on Difference, reflecting weak fine-grained perception. For example, proprietary models like GPT-4o achieved high similarity (81.12%) but a low difference score (39.14%). This suggests that while MLLMs can capture coarse distinctions, they often miss subtle details or, conversely, hallucinate non-existent differences when pushed to be highly descriptive.

This vulnerability was particularly evident when researchers applied visual corruptions (blur, noise) to the videos. Paradoxically, the Similarity score increased, as the noise acted as a regularizer, preventing models from hallucinating minor artifacts. However, the Difference score consistently dropped because the same interference masked genuine, subtle distinctions.

The ViDiC benchmark is set to become a foundation for developing the next generation of multimodal AI, providing a roadmap for researchers to improve model performance in areas critical to explainable video editing, deepfake detection, and comprehensive video reasoning.