New Benchmark Reveals Critical Flaw in Advanced AI Reasoning: MLLMs Can’t “Draw to Think”
November 5, 2025 – Despite rapid advances in artificial intelligence, a new benchmark reveals that even the most powerful multimodal large language models (MLLMs) struggle fundamentally with complex reasoning problems that require visual imagination.
Researchers from ByteDance Seed, UNC-Chapel Hill, UC Santa Cruz, and Stanford have introduced MIRA (Multimodal Imagination for Reasoning Assessment), a challenging new suite of tasks designed to test a model’s ability to generate and utilize intermediate visual steps—like diagrams, structural sketches, or path drawings—a process humans naturally employ when they “draw to think.”
The results show a startling performance gap: leading MLLMs, including closed-source models like GPT-5 and Gemini 2.5 Pro, failed to surpass 20% accuracy when relying solely on their standard reasoning methods.
The Limits of Text-Only Thought
Traditional Chain-of-Thought (CoT) prompting has revolutionized how LLMs tackle multi-step problems by forcing them to articulate their internal logic in text. However, MIRA focuses on problems that are intrinsically visual, where language becomes an awkward and “lossy medium” for conveying necessary geometric, spatial, or physical manipulations.
For instance, one MIRA task requires tracking a die rolled along a complex path, demanding the model calculate the final face-up or sum the hidden face values at each turn. A human would instinctively sketch or visualize the die’s rotation. Current MLLMs, however, attempt to solve this by describing the visual changes step-by-step in words, a method that often leads to failure.
MIRA spans 20 task types across challenging domains, including Euclidean Geometry (such as determining the overlap area of two complex shapes) and Physics-Based Reasoning (like calculating the trajectory of a ball on a billiard table after multiple elastic bounces).
Visual Clues Provide a 33% Boost
To diagnose precisely why models fail, the researchers evaluated MLLMs under a three-level protocol, comparing direct input against text-only CoT, and a third, crucial level: Simulated Visual-CoT.
The Text-CoT setting, where models were prompted to verbalize their steps, proved ineffective—and even detrimental. For models like Gemini 2.5 Pro and o3, asking them to generate a text rationale actually degraded performance by up to 18.3%. This confirms that for visual tasks, textual reasoning alone is often counterproductive.
In stark contrast, the Simulated Visual-CoT setting delivered a massive improvement. By providing models with human-annotated “scratchpad” visual diagrams aligned with the correct reasoning trajectory, performance across all models and tasks jumped by an average relative gain of 33.7%.
In the convex hull geometry task, for example, GPT-5 failed when describing the coordinates and intersections verbally. When given a single image showing the two hulls and their precise overlapping area, the model correctly identified the answer, demonstrating that the failure was not in understanding the question, but in generating the necessary intermediate visual state.
The findings underscore a fundamental challenge: while today’s MLLMs are strong at visual perception and general reasoning, they lack the integrated ability to perform complex internal visual simulations. The study concludes that the future of advanced multimodal AI hinges on developing a unified paradigm capable of true “thinking while drawing”—generating high-quality intermediate visuals and tightly coupling them with subsequent textual reasoning.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.