Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios That Are Difficult for Humans?

🔊

💬 Ask

A new benchmark dataset for evaluating multimodal large language models (MLLMs) called MME-RealWorld has been released, aiming to address limitations of existing benchmarks.

The dataset features high-resolution images and carefully crafted questions that pose challenging real-world scenarios, even for humans. It is the largest manually annotated benchmark known to date, containing over 29,000 question-answer pairs and covering 43 sub-class tasks across 5 domains:

Optical Character Recognition in the Wild (OCR): This task assesses a model’s ability to perceive and understand text in real-world scenes such as street scenes, shops, posters, books, and competitions. Examples include identifying phone numbers or addresses, recognizing product names, or extracting information from elevation maps or comics.
Remote Sensing (RS): This task focuses on interpreting high-resolution images captured from satellites. Examples include counting objects like airplanes or ships, identifying their colors, and determining their spatial relationships.
Diagram and Table (DT): This task evaluates a model’s ability to understand and reason about diagrams and tables containing rich numerical information. Examples include identifying specific values within a table or diagram, performing simple calculations, and predicting trends.
Autonomous Driving (AD): This task tests a model’s understanding of complex driving scenarios, including distant perceptions and interactions among traffic agents. Examples include predicting the intention of the ego vehicle, understanding interactions between traffic elements, and identifying the traffic signal that the driver should pay attention to.
Monitoring (MO): This task assesses a model’s ability to interpret images captured from various cameras, including those with complex scene complexities. Examples include counting objects like pedestrians or cars, recognizing their attributes, and reasoning about their intentions or functions.

MME-RealWorld also features a Chinese language version, MME-RealWorld-CN, which includes images and questions focused on Chinese scenarios. This addresses limitations of existing benchmarks that are often translated from English, leading to potential misalignment between the question and the image.

The authors evaluated 28 prominent MLLMs on the benchmark, including GPT-4, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Their results show that even the most advanced models struggle to reach 60% accuracy on the benchmark, highlighting the significant challenges of perceiving high-resolution images and understanding complex real-world scenarios. This underscores the need for further research and development in the field of MLLMs.

The data and evaluation code are available on the project page: https://mme-realworld.github.io/

AI Papers Reader

Personalized digests of latest AI research

Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios That Are Difficult for Humans?

Chat about this paper