VisionReward: A Fine-Grained Approach to Human Preference Learning for Image and Video Generation
A team of researchers from Tsinghua University and Zhipu AI has unveiled VisionReward, a novel approach to aligning image and video generation models with human preferences. Their findings, detailed in a recent preprint, address critical limitations of existing reward models, leading to significant improvements in the quality and alignment of generated visual content.
Current methods for training visual generation models often rely on Reinforcement Learning from Human Feedback (RLHF). However, these approaches face challenges. Existing reward models are often biased and difficult to interpret, failing to capture the nuanced interplay of multiple factors influencing human judgment. Furthermore, evaluating video quality presents unique difficulties due to its dynamic nature, requiring more sophisticated assessment than static images. Existing methods also struggle to prevent over-optimization or under-optimization of specific aspects of the generated content.
VisionReward tackles these issues head-on with a two-pronged strategy. First, it constructs a fine-grained, multi-dimensional reward model. Instead of a single overall score, VisionReward decomposes human preferences into several dimensions—such as alignment, composition, quality, fidelity, safety, and emotion for images; and extending to 9 dimensions for videos like motion dynamics, stability, and temporal speed. These are captured through a series of interpretable binary yes/no questions for each dimension that are subsequently linearly weighted and summed to create a final score. For example, in assessing image quality, questions might include “Are the colors bright?” and “Is the lighting aesthetically pleasing?” Each “yes” answer contributes positively to the final score.
The researchers demonstrate the effectiveness of this granular approach. They achieve a 17.2% improvement over existing video scoring methods like VideoScore in accurately predicting human preferences. This superior performance highlights VisionReward’s ability to capture subtle details often missed by simpler models.
Second, VisionReward introduces a Multi-Objective Preference Optimization (MPO) algorithm. Unlike methods that directly optimize a single reward score, which can lead to an overemphasis of certain factors at the expense of others, MPO carefully balances multiple dimensions. This is achieved by defining a “dominance” relationship between generated images or videos. If one video’s score is superior across all dimensions compared to another, it is deemed “dominant,” and this information guides the optimization process. This method prevents over-optimization or under-optimization and leads to more balanced and holistic improvements across all dimensions.
The effectiveness of VisionReward and MPO is validated through extensive experiments. The team conducted both automatic evaluations using machine metrics and human evaluations on a large dataset of 3 million image annotations and 2 million video annotations. The results demonstrate significant improvements, particularly in videos, showing superior performance to the baseline models.
The study’s contributions extend beyond the proposed methods. The authors also created a unified annotation system that incorporates fine-grained multi-dimensional evaluation for both image and video. They provide a large scale dataset of fine-grained human preference annotations. This dataset, along with the code, has been made publicly available, paving the way for further research and advancements in visual generation. The VisionReward approach represents a significant step forward in creating more natural, aligned, and high-quality images and videos that better meet human expectations.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.