New Benchmark Reveals Multimodal AI Struggles with Basic Medical Image Perception

🔊

💬 Ask

Researchers have developed MedBLINK, a new benchmark designed to test the fundamental visual perception abilities of multimodal AI models in medicine. The results are stark: even state-of-the-art models, including those specifically trained for healthcare, significantly underperform human experts on seemingly simple tasks.

A recent study introduces MedBLINK, a comprehensive benchmark aiming to assess how well multimodal language models (MLMs) can perceive basic visual information in medical images. The researchers emphasize that for AI to be trusted and adopted in clinical settings, it must reliably handle even the most intuitive visual cues, tasks that experienced clinicians perform almost reflexively.

MedBLINK comprises eight distinct tasks, covering a range of clinically relevant perceptual abilities. These include:

Image Enhancement Detection: Identifying if a CT scan has been enhanced with contrast agents, crucial for interpreting diagnostic differences.
Visual Depth Estimation: Determining the relative depth of objects within medical images, such as in endoscopy or ultrasound scans. For instance, in an endoscopy image showing colored dots, can the AI correctly identify which dot is deepest within the visual field?
Anatomical Orientation: Recognizing if a medical image, like an X-ray, is presented in its correct anatomical orientation or is upside down. Imagine an X-ray of a patient’s chest; the AI needs to know if it’s facing the correct way.
Histology Structure: Understanding the basic structural layers within microscopic tissue samples. For example, distinguishing between the epidermis and dermis in a skin biopsy.
Morphological Quantification: Accurately counting specific features, such as the number of wisdom teeth in a dental X-ray.
Relative Position: Comprehending the spatial relationships between different parts of the anatomy, for instance, determining which of two CT slices is closer to the pelvis.
Age Estimation: Identifying anatomical differences that indicate a patient’s age group from images like chest X-rays.

The benchmark features over 1,400 multiple-choice questions derived from more than 1,600 expert-validated images across various modalities like X-rays, CT scans, endoscopy, ultrasound, and histology. Nineteen leading MLMs were evaluated, including general-purpose models like GPT-40 and Claude 3.5 Sonnet, alongside medical-specific models such as Med-Flamingo and LLaVA-Med.

The findings are sobering. While human experts achieve an impressive 96.4% accuracy across the benchmark, the best-performing AI model managed only 65% accuracy. Many models struggled significantly with tasks like contrast detection and counting, with some performing at or below random chance. Even models with specialized medical training often underperformed general models, suggesting they might rely on superficial correlations rather than genuine visual understanding.

“This gap shows that many models lack fundamental visual grounding and therefore cannot yet be trusted for clinical use,” the researchers state. They highlight that improving these basic perceptual abilities is essential before AI systems can be reliably deployed in high-stakes medical decision-making scenarios. MedBLINK aims to provide a critical tool for identifying and addressing these shortcomings, ultimately guiding the development of more trustworthy and effective medical AI.

AI Papers Reader

Personalized digests of latest AI research

New Benchmark Reveals Multimodal AI Struggles with Basic Medical Image Perception

Chat about this paper