AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Diffusion Models Can Now Build Scenes From 3D Layouts

Imagine you want to create a stunning digital image of a living room. You can describe it with words: “a cozy room with a white sofa, a floor lamp, and a plant pot”. But what if you could also place these objects in 3D space, just like a real-world interior designer?

That’s what a new research paper, “Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation”, allows. It presents a method that gives users unprecedented control over the objects in images generated by diffusion models. Diffusion models are a powerful new class of AI systems that can create high-quality, realistic images based on text prompts.

Previous attempts to control the placement and arrangement of objects in diffusion models have been limited to 2D layouts. This means that users had to describe the position and size of objects within a flat plane, without any sense of depth. These approaches also suffered from a major drawback: they couldn’t guarantee that the generated image would remain consistent if the user changed the layout.

Build-A-Scene breaks these limitations by introducing an innovative approach that leverages the power of 3D bounding boxes. By placing objects within 3D boxes, users can control their position and orientation in a 3D space. The system ensures that the generated image remains consistent as the user moves objects around, adding, removing, or changing them, without compromising the overall quality and style of the scene.

“We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control,” the researchers write. “To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages.”

To achieve this, the authors introduce a new module called Dynamic Self-Attention (DSA), which enables seamless integration of new objects into a scene while preserving the existing ones. They also propose a consistent 3D object translation strategy, which guarantees that objects maintain their identity when moved or scaled.

“Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x,” the authors report. “Moreover, it outperforms other methods in comparison in preserving objects under layout changes.”

Build-A-Scene is a significant step forward in the development of interactive image generation tools. It opens up a whole new world of possibilities for creative professionals, including interior designers, architects, and artists, who can now leverage AI to quickly and easily visualize their ideas in 3D space.