Precise and Efficient Control of Video Generation with TrackGo
đź“„ Full Paper
đź’¬ Ask
Researchers have recently made significant progress in using diffusion models to generate realistic videos. However, precisely controlling the content of these videos – for example, making a specific object move in a particular way – remains a challenge. A new paper introduces TrackGo, a method for controllable video generation that addresses this challenge by allowing users to specify target objects using free-form masks and desired motion trajectories using arrows.
“Existing methods often struggle to achieve precise control over object movement and scene transformations in generated videos,” says Haitao Zhou, one of the paper’s authors. “For instance, some approaches use bounding boxes to specify the target area, but these often encompass redundant regions that interfere with the motion of the target and disrupt the coherence of the background.” TrackGo overcomes these limitations by allowing users to select target areas with a brush and then indicate the movement trajectory with an arrow, providing more precise control.
TrackGo also introduces a new component called TrackAdapter, which is a lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. The researchers observed that the attention maps generated by these layers effectively highlight areas of motion within a video. TrackAdapter leverages this by directly manipulating the attention map to achieve precise motion control, a process that minimizes additional computational overhead.
To demonstrate the effectiveness of their approach, the researchers conducted a series of experiments, comparing TrackGo to existing methods like DragAnything and DragNUWA. Their results show that TrackGo significantly outperforms these approaches across all metrics, including video quality (measured using FVD and FID) and motion faithfulness (measured using ObjMC).
The researchers also conducted a user study, asking participants to select the best-looking video from a set of three generated by TrackGo, DragAnything, and DragNUWA. The results showed that TrackGo achieved 62% of the votes, significantly higher than DragAnything’s 16.33% and DragNUWA’s 21.67%.
TrackGo represents a significant advance in the field of controllable video generation, offering a flexible and efficient approach for manipulating video content. The researchers believe that TrackGo has the potential to be used in a wide range of applications, including film production, cartoon creation, and animation.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.