Multimodal AI Models Flop on Complex Spatial Reasoning Tasks, New Benchmark Reveals
A new benchmark called MARBLE (Multimodal Reasoning Benchmark for Language Models) has exposed a significant weakness in today’s most advanced artificial intelligence systems: their ability to perform complex, multi-step spatial reasoning and planning. Despite impressive advancements in language and multimodal understanding, current AI models falter when faced with tasks requiring intricate step-by-step logic in visually and physically constrained environments.
The MARBLE benchmark, developed by researchers at EPFL and ETH Zurich, features two challenging tasks designed to push the boundaries of multimodal artificial intelligence. The first, M-PORTAL, is inspired by the popular video game Portal 2. It requires AI models to understand spatial layouts and plan sequences of actions involving portals, momentum, and environmental elements to solve puzzles. Imagine an AI needing to figure out how to use a portal to redirect a laser beam to activate a switch, then using that momentum to launch an object across a gap – a task that demands not just visual recognition but a deep understanding of cause and effect in a 3D space.
The second task, M-CUBE, draws inspiration from the “Happy Cube” puzzle. Here, AI models are presented with a set of 3D jigsaw pieces and must assemble them into a perfect cube, ensuring that the interlocking bumps and gaps align seamlessly across all faces. This task tests the AI’s ability to mentally manipulate objects in three dimensions, consider multiple orientations, and crucially, understand how individual pieces fit together to form a coherent whole. Think of trying to assemble a complex 3D puzzle from just visual clues, where each piece has a unique shape and requires precise rotation.
The results of evaluating 12 leading multimodal large language models (MLLMs) on MARBLE were stark. Across both M-PORTAL and M-CUBE, the models performed at or near random chance levels. For M-PORTAL’s “plan correctness” task, where models must judge if a proposed sequence of actions is valid, most achieved a mere 6% F1 score. Even on a simpler “fill-in-the-blanks” version of M-PORTAL, where models fill in missing steps in a plan, the best performer, GPT-03, only managed 17.6% accuracy.
The M-CUBE task proved even more formidable. All tested models achieved a dismal 0% accuracy on the full M-CUBE task, indicating a complete failure to assemble the cubes correctly. Even on a simplified version, CUBE-easy, which reduces complexity by simplifying pieces and offering partial solutions, only GPT-03 achieved a noteworthy 72% accuracy, with most other models scoring significantly lower.
Researchers identified two primary bottlenecks: perception and reasoning. Even basic visual perception, such as converting a 3D jigsaw piece into a 2D array of bumps and gaps, proved challenging for MLLMs, with accuracy hovering around 70% for individual parts and dropping to 0% for the entire piece. This suggests that current models struggle to accurately extract and interpret crucial visual information, which is foundational for any spatial reasoning task.
The benchmark’s findings highlight the need for AI systems that can go beyond simple pattern matching. True multimodal reasoning requires models to decompose complex problems into manageable steps, integrate information from various sources, and maintain a coherent understanding of physical and spatial constraints. MARBLE aims to drive progress in this critical area, pushing the development of AI that can navigate and interact with the real world with greater intelligence and adaptability.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.