InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
A new multi-modal reward model, InternLM-XComposer2.5-Reward (IXC-2.5-Reward), promises to significantly improve the performance of Large Vision Language Models (LVLMs). Developed by researchers at the Shanghai Artificial Intelligence Laboratory and several Chinese universities, the model addresses a critical gap in the field: the scarcity of publicly available, effective multi-modal reward models for LVLMs.
Existing reward models (RMs) primarily focus on text, leaving a significant gap in evaluating and improving the quality of LVLMs’ outputs for images and videos. IXC-2.5-Reward tackles this challenge by leveraging a high-quality multi-modal preference corpus containing text, image, and video data across diverse domains like instruction following, general understanding, mathematical reasoning, and video comprehension.
The model itself is surprisingly simple. Instead of complex architectures, it augments an existing LVLMs (InternLM-XComposer2.5) with a scoring head that predicts reward scores. This design allows it to effectively evaluate inputs across various modalities and domains, a key feature currently missing in many existing models.
Concrete Examples:
Imagine you have an LVLMs that needs to answer a question about a video. A traditional, text-only RM might only assess the linguistic quality of the response. IXC-2.5-Reward, however, goes further. It can analyze the video itself, comparing the content to the generated response to ensure accuracy and coherence. This allows the RM to penalize wrong answers more effectively.
Another example involves instruction following. A user provides an instruction with an image, say “Write a caption for this image of a cat playing with a ball of yarn.” A traditional model might only judge the grammatical correctness and relevance of the generated caption. IXC-2.5-Reward, in contrast, considers the image content too – ensuring the generated caption matches the image content appropriately. If the image depicts a cat playfully batting a ball of yarn, but the caption instead describes a dog sleeping, IXC-2.5-Reward will provide a lower score.
Key Applications:
The researchers demonstrated three primary applications of IXC-2.5-Reward:
-
Reinforcement Learning (RL) Training: IXC-2.5-Reward provides a supervisory signal during RL training, leading to a significant improvement in an LVLMs’ ability to follow instructions and engage in multi-modal conversations. This resulted in a new chat model, IXC-2.5-Chat, which substantially outperforms previous open-source models on various benchmarks.
-
Test-Time Scaling: By using IXC-2.5-Reward to select the best response from multiple candidate outputs generated by the LVLMs, the researchers achieved additional performance gains, improving accuracy without increasing computational costs.
-
Data Cleaning: IXC-2.5-Reward can effectively identify and filter out noisy or outlier samples from existing image and video instruction-tuning datasets, thus improving the quality of the training data and consequently, model performance.
Benchmark Results:
The paper highlights the model’s superior performance compared to existing models. IXC-2.5-Reward achieved the best performance on the VL-RewardBench multi-modal benchmark, surpassing even proprietary models like Gemini-1.5-Pro and GPT-40. It also showed competitive performance on text-only reward benchmarks, showcasing its robust and versatile nature.
The authors open-sourced the model weights and training recipes, making their work fully reproducible and promoting further research in the field of multi-modal reward models for LVLMs. This contribution is highly significant, pushing forward the development of more accurate, reliable, and versatile LVLMs that are capable of handling diverse types of inputs and tasks.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.