New Benchmark Uncovers "Semantic Aggregation Hallucination" in Long Videos
Researchers have introduced ELV-Halluc, the first benchmark specifically designed to identify and address a subtle yet significant type of error in video understanding models: “Semantic Aggregation Hallucination” (SAH). This new benchmark aims to improve the reliability of AI systems by tackling how these models misunderstand and misattribute information across longer video sequences.
Video multimodal large language models (Video-MLLMs) have made impressive strides in understanding video content. However, they often struggle with “hallucinations,” generating descriptions that deviate from or are entirely fabricated beyond the visual evidence. While previous research has focused on hallucinations in short videos, often attributing them to issues like poor image quality or language biases, a new study highlights a more complex problem that arises in longer videos.
The paper, titled “ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding,” introduces Semantic Aggregation Hallucination (SAH). SAH occurs when a model correctly perceives the individual frames or short segments of a video but incorrectly combines or aggregates this information to form a larger event-level understanding. This is particularly problematic in long videos that contain multiple, distinct events.
“Imagine watching a cooking show where a chef prepares a dish,” explains lead author Hao Lu. “The model might correctly identify each ingredient and step. But with SAH, it could mistakenly attribute an ingredient used in the dessert to the main course, or vice versa, even though it saw both correctly at different times.”
The researchers have developed ELV-Halluc, a comprehensive benchmark that includes 8,000 adversarial data pairs. This benchmark is designed to systematically evaluate SAH by presenting models with videos that have clearly separated events. ELV-Halluc categorizes hallucinations across four aspects: visual details, actions, objects, and declarative content.
Experiments conducted using ELV-Halluc revealed that SAH is indeed a significant issue, becoming more pronounced as the semantic complexity of a video increases, for example, with more distinct events. The benchmark also found that models are more prone to SAH when dealing with rapidly changing visual details or actions.
To address this challenge, the researchers explored mitigation strategies. They found that improving positional encoding strategies, which help models understand the temporal relationships between different parts of a video, can help reduce SAH. Furthermore, they employed a technique called Direct Preference Optimization (DPO) to train models to better distinguish between correct event-level semantics and misattributed information.
By combining these strategies and using the ELV-Halluc benchmark, the team achieved a substantial 27.7% reduction in SAH ratio on their test set. The new benchmark and evaluation code are publicly available to encourage further research in this critical area of AI video understanding. This work represents a significant step towards creating more reliable and trustworthy AI systems that can accurately comprehend complex, real-world video content.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.