New Benchmark MATHREAL Tackles Real-World Math Challenges for AI
Beijing, China - As artificial intelligence models, particularly Large Language Models (LLMs), demonstrate increasingly sophisticated capabilities in understanding and solving mathematical problems, a new benchmark called MATHREAL has been introduced to bridge the gap between lab-tested performance and real-world educational scenarios. Developed by researchers from Baidu Inc., Nanyang Technological University, Xiaopeng Motors, and Renmin University of China, MATHREAL aims to evaluate how well these multimodal AI systems perform when faced with the messy, authentic inputs that students actually encounter.
Existing benchmarks often rely on clean, digitally generated math problems. However, students typically interact with math questions through photos taken with mobile phones, often resulting in images with varying quality, perspective distortions, and extraneous information. MATHREAL addresses this by providing a dataset of 2,000 such “in-the-wild” math questions, meticulously collected and annotated from authentic K-12 educational contexts.
The dataset categorizes real-world image challenges into three main groups: image quality degradation (e.g., blur, glare), perspective variation (e.g., rotation, tilt), and irrelevant content interference (e.g., handwritten notes, shadows). These are further broken down into 14 subcategories, offering a granular assessment of how these imperfections affect AI performance. Each question is also classified by difficulty level and covers five core knowledge areas: geometry, algebra, statistics, logical reasoning, and function graphs.
To test the robustness of AI models, the researchers designed six experimental settings that progressively disentangle visual perception from reasoning. Their findings revealed a significant performance gap between models tested on clean images versus those evaluated on realistic, noisy inputs. Even the top-performing models struggled, achieving accuracies as low as 53.9% on real-world scenarios, a stark contrast to the near-human performance often reported on cleaner benchmarks.
The study highlights that visual conditions like blur, rotation, and the presence of handwritten annotations significantly impair the reasoning abilities of current multimodal LLMs. This suggests that while these models may excel with controlled inputs, their visual perception components are fragile when exposed to the everyday complexities of educational materials.
For example, a question might appear on a crumpled piece of paper with a student’s handwritten annotations or smudged ink. An AI model needs to not only correctly interpret the printed text and diagrams but also filter out the irrelevant handwritten marks and compensate for any image distortions. MATHREAL provides a controlled environment to measure these capabilities.
The researchers conducted extensive evaluations across 40 multimodal models, finding that closed-source models generally outperformed their open-source counterparts, especially under noisy conditions. However, even the leading models showed considerable room for improvement, particularly in tasks requiring robust visual understanding and consistent multi-step reasoning.
The MATHREAL benchmark is expected to drive further research into developing more resilient and adaptable multimodal AI systems, better equipping them to support students in authentic learning environments. The project also emphasizes the need for models that can reliably process optical character recognition (OCR) from noisy images and accurately understand visual elements, laying the groundwork for more practical AI applications in education.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.