MIA-DPO: A New Approach to Visual Preference Alignment for Large Vision-Language Models
๐ Full Paper
๐ฌ Ask
Large Vision Language Models (LVLMs) are a hot topic in AI research. They are capable of understanding and generating text and images, but they often struggle with multi-image tasks, such as those that involve multiple images, captions, or instructions. This is because LVLMs are typically trained on single-image datasets, which may not be sufficient to equip them with the necessary skills for reasoning about complex visual scenes.
A team of researchers from Shanghai AI Laboratory, SJTU, and CUHK have developed a new technique, called MIA-DPO (Multi-Image Augmented Direct Preference Optimization), to address the issue of visual preference alignment for LVLMs in multi-image scenarios. This approach aims to enable LVLMs to make better decisions in complex visual environments by learning to predict human preferences between multiple images.
MIA-DPO works by extending single-image training data with additional unrelated images arranged in collages or pic-in-pic formats. This helps to diversify the training data and enables the LVLM to learn to differentiate between relevant and irrelevant images. MIA-DPO uses attention values to identify and filter out rejected responses that the LVLM might have mistakenly focused on.
For instance, imagine a multi-image captioning task where an LVLM is asked to describe what is happening in a particular image. If the LVLM is confused by the other images in the collage, it may provide an inaccurate description. By using MIA-DPO, the LVLM can learn to pay attention to the correct image and focus on its key features, resulting in a more accurate caption.
MIA-DPO outperforms existing methods on five multi-image benchmarks, achieving an average performance boost of 3% on LLaVA-v1.5 and 4.3% on the recent InternLM-XC2.5. The researchers found that MIA-DPO also has a minimal effect on the modelโs ability to understand single images.
MIA-DPO is a promising approach to improving the performance of LVLMs in multi-image scenarios. It is a cost-effective method that can be easily implemented with existing training data. The researchers believe that MIA-DPO will be essential for developing LVLMs that can effectively handle complex visual information in real-world settings.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.