VERIFY: A New Benchmark Challenges AI's Visual Reasoning Capabilities
Rochester, NY – Recent advances in multimodal large language models (MLLMs) have shown impressive performance in tasks requiring both language and vision. However, a new study introduces VERIFY, a benchmark designed to specifically evaluate the visual reasoning abilities of these models, revealing significant limitations.
Traditional benchmarks often focus on recognition-based skills, such as object detection and image captioning. VERIFY, in contrast, requires models to reason primarily from visual information, minimizing reliance on prior knowledge or linguistic cues. For example, one VERIFY challenge presents a series of images and asks the model to identify the missing image that completes a pattern. The options might involve changes in shapes, positions, or other abstract visual properties. These tasks are similar to Raven’s Progressive Matrices, a test used to assess human abstract reasoning.
VERIFY goes beyond simple accuracy by providing human-annotated reasoning paths for each problem. This allows researchers to understand how models arrive at their decisions, uncovering biases and shortcomings in their reasoning processes.
The benchmark reveals that even the most advanced MLLMs struggle with visual reasoning. For example, when shown a grid of images where each row and column must contain specific shapes in a certain order, models often misinterpret the visual cues and fail to identify the correct missing image. In fact, testing on VERIFY shows that some LLMs perform worse than random chance (25%).
The authors of VERIFY propose novel metrics to assess visual reasoning fidelity beyond mere accuracy. These metrics capture the depth and quality of reasoning, highlighting critical imbalances in how models approach problems. The study identifies an issue called “Semantic Dominance Bias,” where models overly rely on recognizing letters or familiar objects rather than processing the shapes of the image.
“We found that models, when trying to recognize patterns, often got tripped up by letters that had to be understood not as ‘A, B, C…’ but more like a ‘straight line here, and a curve there’, but they were missing that” said one of the researchers.
The research team believes that VERIFY will help push the development of more robust and balanced MLLMs, promoting a holistic approach to both perception and reasoning. The dataset and code are available on the project website.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.