AI Papers Reader

Personalized digests of latest AI research

View on GitHub

DreamWaltz-G: Generating Expressive 3D Avatars From Text

While 2D diffusion models have shown promise for generating 3D avatars, creating high-quality 3D avatars that can convincingly animate remains challenging. This is due to the intricate structures of 3D humans, such as hands and faces, as well as the need to maintain consistency of appearance and pose across different views. Existing text-to-3D methods struggle to capture these details and often produce artifacts such as blurry images or multiple faces.

To address these shortcomings, a team of researchers from the University of Hong Kong, Tencent, and the University of Science and Technology of China, have developed a novel learning framework, called DreamWaltz-G, that utilizes skeleton-guided score distillation (SkelSD) and Hybrid 3D Gaussian Avatars (H3GA) to generate high-quality 3D avatars that can be expressively animated.

Skeleton-Guided Score Distillation

SkelSD incorporates skeleton controls from 3D human templates into 2D diffusion models. This integration enhances the consistency of score distillation supervision across different views and poses, which helps to generate more realistic and stable 3D avatars. For example, by integrating skeletal guidance, the model can better differentiate between a person’s front and back view, reducing the likelihood of generating artifacts such as multiple faces or blurry images.

Hybrid 3D Gaussian Avatars

H3GA combines the efficiency of 3D Gaussian splatting, the local continuity of neural implicit fields, and the geometric accuracy of parameterized meshes. This hybrid representation enables stable optimization and expressive animation. For instance, H3GA allows for real-time rendering of 3D avatars, which is crucial for applications such as video games or interactive media. Moreover, it enables expressive animation with finger movements and facial expressions, adding a new level of realism to the generated avatars.

DreamWaltz-G: A Framework for Animatable 3D Avatars

DreamWaltz-G consists of two training stages:

  1. Canonical Avatar Generation: The first stage focuses on generating a canonical 3D avatar with a static pose, given a text prompt. This is achieved by training a neural radiance field (NeRF) model using skeleton-guided score distillation and leveraging the SMPL-X 3D human body model to provide geometry constraints and rendering skeleton images to ensure view consistency.

  2. Animatable Avatar Learning: The second stage takes the canonical avatar from the first stage and learns how to animate it expressively. This is done by training a hybrid 3D Gaussian avatar representation using skeleton-guided score distillation and randomizing the poses and expressions of the SMPL-X model.

Applications

The researchers demonstrate the effectiveness of DreamWaltz-G in various downstream applications, including:

Conclusion

DreamWaltz-G represents a significant advancement in text-driven 3D avatar generation. It combines the strengths of skeleton-guided score distillation and a hybrid 3D Gaussian avatar representation to generate high-quality, animatable avatars with a wide range of applications. This work paves the way for more realistic and interactive virtual environments, enabling a new era of digital content creation.