URSA: A New Model for Multimodal Mathematical Reasoning
A team of researchers from Tsinghua University and ByteDance have introduced URSA-7B, a new large language model (LLM) designed for multimodal mathematical reasoning. Detailed in a recent paper, URSA outperforms existing models by synthesizing high-quality training data and incorporating a novel process reward model for test-time scaling. This advancement significantly improves the accuracy and robustness of mathematical reasoning in LLMs, particularly in complex scenarios involving both text and images.
The Challenge of Multimodal Math Reasoning
Current LLMs excel at various tasks, but struggle with complex mathematical reasoning, especially when visual information is involved. This is largely due to the scarcity of high-quality training data that explicitly demonstrates the chain-of-thought (CoT) reasoning process—the step-by-step logical progression used to solve problems. The lack of this data limits models to simple pattern matching, preventing them from truly understanding and solving intricate problems.
URSA’s Three-Module Synthesis Strategy
To address this data limitation, the researchers developed a three-module synthesis strategy:
-
CoT Distillation: Existing datasets are often limited to simple question-answer pairs. URSA uses a powerful generator model (Gemini-1.5-Flash-002) to distill detailed CoT reasoning trajectories from these limited datasets. For example, instead of just getting the answer to a geometry problem, the model is prompted to provide a step-by-step solution explaining each geometric principle involved.
-
Trajectory-Format Rewriting: The generated CoT trajectories often lack uniformity in style and format. URSA’s system rewrites these trajectories to ensure consistency, making the training process more efficient. This helps the model to generalize better to unseen problem types.
-
Format Unification: The researchers collected various datasets from different sources, each with varying formatting conventions. URSA’s system unifies these disparate formats into a single, consistent structure, further enhancing training data quality.
These three modules result in a high-quality multimodal mathematical CoT reasoning dataset, MMathCoT-1M, containing approximately one million samples.
DualMath-1.1M and URSA-RM-7B: Test-Time Scaling
Even with high-quality training data, LLMs can still generate incorrect reasoning trajectories during inference. To mitigate this, URSA introduces a dual-view process supervision strategy and a verifier model, URSA-RM-7B. DualMath-1.1M, a dataset created via this strategy, uses Monte Carlo Tree Search to locate errors in URSA-7B’s reasoning trajectories and then synthesize new, corrected trajectories. URSA-RM-7B acts as a verifier, evaluating the correctness of URSA-7B’s inference process. This dual approach significantly enhances test-time scaling capabilities, allowing the model to efficiently select the most likely correct solution from among many possible solutions.
Impressive Results
Extensive benchmark testing on widely used datasets demonstrates URSA-7B’s superiority. It achieves state-of-the-art results on several multimodal mathematical reasoning benchmarks, surpassing both closed-source models (like GPT-4) and other open-source models in many categories. Furthermore, URSA-RM-7B demonstrates excellent out-of-distribution (OOD) generalization capabilities, showcasing its robustness.
Open-Sourcing
The model weights, training data, and code are all open-sourced, making this advancement accessible to the broader research community and enabling further improvements and applications in this rapidly evolving field.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.