New Model, VL-Cogito, Masters Complex Reasoning Through Progressive Curriculum
San Francisco, CA – August 1, 2025 – Researchers at DAMO Academy and Fudan University have unveiled VL-Cogito, a novel multimodal reasoning model that significantly enhances a language model’s ability to understand and respond to complex, multi-faceted questions involving images and text. The breakthrough lies in a sophisticated training method called Progressive Curriculum Reinforcement Learning (PCuRL), which systematically guides the model through tasks of increasing difficulty, leading to more robust and efficient reasoning.
Traditional reinforcement learning (RL) has shown promise in improving the reasoning capabilities of large language models (LLMs). However, when applied to multimodal tasks, which combine different types of data like images and text, existing models often struggle with unstable performance due to the inherent complexity and varied nature of these problems. VL-Cogito aims to bridge this gap with its innovative PCuRL framework.
At its core, PCuRL employs two key strategies. First, an “online difficulty soft weighting” mechanism dynamically adjusts the training focus. This means the model is progressively exposed to more challenging tasks, analogous to how a student learns from easier concepts before tackling more advanced ones. For instance, a simple question like identifying the main subject in a clear image might be presented first, followed by more intricate problems like deciphering information from a complex chart or solving a multi-step geometry problem with accompanying diagrams.
Second, VL-Cogito incorporates a “dynamic length reward” mechanism. Unlike previous methods that might push models to produce uniformly long answers, this strategy encourages the model to adapt the length of its reasoning based on the complexity of the task. For simpler queries, a more concise explanation is rewarded, promoting efficiency. For more challenging problems, the model is incentivized to produce a more detailed, step-by-step reasoning process. This is akin to a student giving a brief summary for a straightforward question but elaborating with detailed explanations for a complex one.
Experimental results show VL-Cogito outperforming or matching existing state-of-the-art models across a range of multimodal reasoning benchmarks, including those focused on mathematics, science, and general understanding. The paper highlights that VL-Cogito achieves these impressive results without requiring a preliminary “cold-start” supervised fine-tuning phase, directly benefiting from the PCuRL framework.
The research team conducted thorough ablation studies, breaking down the performance gains attributed to each component of PCuRL. These analyses confirmed that both the progressive curriculum and the dynamic length reward mechanisms significantly contribute to the model’s enhanced reasoning abilities, leading to improved accuracy and efficiency. The findings suggest that this meticulously crafted curriculum learning strategy holds substantial potential for advancing the capabilities of multimodal reasoning models across a wider array of applications.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.