Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
Recent advancements in large language models (LLMs) have introduced “slow-thinking” reasoning capabilities, allowing them to deliberate and revise their thought processes for more accurate outcomes. This paper explores how to imbue vision-language models (VLMs) with a similar “visual reflection” ability.
The researchers found that existing VLMs struggle with visual reflection. Their attention to visual information diminishes rapidly as they generate longer responses. This means that while a VLM might appear to be reasoning, it often detaches from the visual input, leading to potential errors or “visual neglect.”
To address this, the paper proposes a novel two-stage training strategy called “Reflection-V.”
Stage 1: Cold-Start Initialization with Visual Reflection Instead of relying solely on text-based reasoning derived from image descriptions, Reflection-V constructs a vision-centered reasoning dataset. This is achieved through an interactive process involving LLMs and VLMs. An LLM agent guides a VLM through a reasoning task, requiring the VLM to continuously access and utilize visual information. This data construction ensures that visual reflection patterns are learned from the outset.
Stage 2: Reinforcement Learning with Visual Attention Reward Following the initial training, a reinforcement learning (RL) stage is employed. Here, a reward mechanism specifically encourages the VLM to maintain attention to visual information throughout the reasoning process. This helps to reinforce the visual reflection capabilities learned in the first stage.
Key Findings and Results:
- Improved Performance: Reflection-V demonstrates significant improvements across multiple visual reasoning benchmarks, including those for math, multi-disciplinary, and general reasoning.
- Sustained Visual Reliance: Unlike existing models, Reflection-V maintains a stronger and more consistent reliance on visual information, even as more tokens are generated. This is evidenced by metrics such as visual attention weight and visual dependency.
- Reduced Hallucinations: The enhanced focus on visual information helps to suppress visual hallucinations, a common issue in VLMs.
- Effective Across Models: The proposed training strategy is effective across models of different scales.
Example of Visual Reflection:
The paper presents a case study where a VLM is asked to find the Fourier series for a sawtooth waveform. While a standard VLM might offer a textual justification, Reflection-V actively re-examines the image (“Let’s check the image again”) and uses visual cues to refine its reasoning, ultimately arriving at the correct answer. This “aha moment” is directly linked to its ability to visually verify information.
In essence, Reflection-V aims to bridge the gap between text-only reasoning and the rich, visual world, enabling VLMs to truly “look again and think slowly” when tackling complex visual reasoning tasks.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.