AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios That Are Difficult for Humans?

A new benchmark dataset for evaluating multimodal large language models (MLLMs) called MME-RealWorld has been released, aiming to address limitations of existing benchmarks.

The dataset features high-resolution images and carefully crafted questions that pose challenging real-world scenarios, even for humans. It is the largest manually annotated benchmark known to date, containing over 29,000 question-answer pairs and covering 43 sub-class tasks across 5 domains:

MME-RealWorld also features a Chinese language version, MME-RealWorld-CN, which includes images and questions focused on Chinese scenarios. This addresses limitations of existing benchmarks that are often translated from English, leading to potential misalignment between the question and the image.

The authors evaluated 28 prominent MLLMs on the benchmark, including GPT-4, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Their results show that even the most advanced models struggle to reach 60% accuracy on the benchmark, highlighting the significant challenges of perceiving high-resolution images and understanding complex real-world scenarios. This underscores the need for further research and development in the field of MLLMs.

The data and evaluation code are available on the project page: https://mme-realworld.github.io/