DenseDPO: Optimizing Video Diffusion Models with Fine-Grained Temporal Preferences

🔊

💬 Ask

A new technique called Dense Direct Preference Optimization, or DenseDPO, aims to improve the quality of videos generated by AI diffusion models. Researchers at Snap Research and the University of Toronto found that existing methods struggle with generating dynamic, motion-rich videos, often favoring slow-motion clips due to biases in human preference data. DenseDPO tackles this problem by refining the way training data is created and leveraging segment-level human feedback.

Traditional methods ask annotators to pick between two entire videos generated from random noise. However, people tend to prefer less dynamic, artifact-free videos. This reinforces the model’s tendency to produce slow-motion content.

DenseDPO introduces three key improvements:

Data Construction: Instead of starting from scratch, DenseDPO generates video pairs by corrupting copies of existing, high-quality videos with different levels of noise, known as guided generation. This ensures that both videos in the pair share similar motion patterns and high-level semantics but differ in local visual details. For example, to create variations of someone doing a handstand on a beach, the base video would be sourced from ground truth and the variations would contain distortions in pixel fidelity but retain the same action.
Segment-Level Feedback: DenseDPO breaks videos into short, temporally aligned segments (e.g., 1-second clips). Annotators then provide preferences for each segment. This denser feedback allows for more precise learning signals. To illustrate, imagine two videos of someone skateboarding. In the first second, video A might be better because video B contains a visual glitch. However, in the third second, video B might be better because video A’s camera angle is awkward. DenseDPO lets you capture these nuanced preferences over time, where standard DPO can only express an overall judgement.
Automatic Preference Annotation: The study explores using Vision Language Models (VLMs), like GPT, to predict segment-level preferences automatically. Remarkably, with segment level annotation and the DenseDPO method, GPT produces accuracy close to that of a task-specifically fine tuned video reward model, allowing DPO results to be competitive with DPO methods utilizing human labels.

The research team evaluated DenseDPO on standard benchmarks (VideoJAM-bench and a custom MotionBench). They found that DenseDPO not only retains the motion strength of the base model but also matches or exceeds the performance of traditional DPO methods in areas like text alignment, visual quality, and temporal consistency. Crucially, it achieves these results with significantly less labeled data. For example, training DenseDPO on one third of the data as vanilla DPO still matches performance in all aspects of DPO, and scores significantly higher in dynamic degree.

The results suggest that DenseDPO can improve the quality and dynamism of AI-generated videos, bringing them closer to real-world complexity. The ability to automate preference annotation using VLMs further lowers the barrier to entry, enabling wider adoption and faster iteration in video diffusion model training.

AI Papers Reader

Personalized digests of latest AI research

DenseDPO: Optimizing Video Diffusion Models with Fine-Grained Temporal Preferences

Chat about this paper