New AI Framework Generates Truly Diverse Videos from Single Prompts
For all their rapid advancements, today’s text-to-video (T2V) diffusion models—capable of generating hyper-realistic, high-fidelity clips—suffer from a major flaw: a diversity deficit. Given a single text prompt, models often default to generating outputs that are highly similar, favoring a narrow distribution of styles, camera angles, or subject motions.
Researchers from Virginia Tech have introduced a novel solution, DPP-GRPO (Determinantal Point Process-Guided Policy Optimization), designed to explicitly optimize T2V systems for broad set-level diversity. The plug-and-play framework, which can be applied to existing open-source and black-box models like Wan, CogVideoX, and Veo, forces the generator to explore the full range of plausible visual and cinematic outcomes for any given input.
The core innovation of DPP-GRPO lies in its two-part reinforcement learning objective. First, it utilizes the Determinantal Point Process (DPP) theory to implement a diminishing-returns reward mechanism. In simple terms, if the AI generates a video that is too similar to one already produced for that prompt—say, a fourth close-up of a cat—the reward is heavily penalized. The system is therefore rewarded more for generating a novel factor, such as a different scene layout or camera motion.
Second, the framework employs Group Relative Policy Optimization (GRPO), which computes feedback over an entire batch of candidate videos. Instead of focusing solely on the quality of a single video, GRPO pushes the model to select a set of videos that, when viewed together, jointly cover the widest semantic and cinematic dimensions possible.
This set-level optimization approach leads to dramatic increases in output variability across key video dimensions, including camera motion, visual appearance, and scene structure.
For example, when prompted with “A water lily rests on a calm pond,” a standard T2V model might repeatedly output variations of a close-up, soft-focus shot. In contrast, DPP-GRPO generates a diverse set that includes a minimalist vector style, a soft watercolor depiction, a photorealistic top-down aerial view, and a high-contrast shot with bold outlines—all while maintaining perfect prompt fidelity.
Similarly, a request like “A skateboarder performs jumps” yields videos that explore varied subject appearances, different environments (from a sunlit park plaza to a concrete skatepark at sunset), and different camera movements (a ground-level tracking shot versus a wide cinematic perspective). For a prompt like “A giraffe bending to sip water from a sunlit savanna pool,” the framework produces clips that vary across close-up angles, low-angle views, and different painterly color-graded finishes.
Quantitative evaluations across standard benchmarks like VBench confirmed that DPP-GRPO significantly outperforms existing baselines and prompt optimization methods in diversity metrics (TCE, TIE, and VENDI) while maintaining, or even improving, semantic fidelity and temporal coherence.
Crucially, the framework operates efficiently with minimal computational overhead (adding less than 1% to inference time), making it a practical, accessible tool. By treating diversity as a primary optimization goal, DPP-GRPO moves T2V generation beyond rote production, offering creators varied and imaginative cinematic choices from a single text input. The researchers have released a new benchmark dataset of 30,000 diverse prompts to further support research into this emerging area.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.