Scientists Unleash a Diffusion Model that Can Generate 16K Images on a Single GPU in Just One Minute

🔊

💬 Ask

Diffusion models, a type of artificial intelligence system, have rapidly become the leading technology for creating stunning images. They produce impressive results with intricate details by iteratively refining noise vectors. However, generating high-resolution images with existing diffusion models presents significant challenges. They heavily rely on self-attention operations to manage complex spatial relationships, but this approach has quadratic time and memory complexity with respect to the number of spatial tokens.

In a new paper titled “LINFUSION: 1 GPU, 1 MINUTE, 16K IMAGE,” researchers from the National University of Singapore propose an innovative solution to this problem. Their model, LinFusion, introduces a generalized linear attention mechanism that can achieve performance on par with, or even superior to, the original Stable Diffusion model, while significantly reducing time and memory complexity. In short, LinFusion allows for the generation of larger, more detailed images than ever before.

The LinFusion model takes inspiration from recent advancements in linear-complexity token mixers. These advancements have shown promise in sequential generation tasks. The researchers began by investigating the applicability of these models in diffusion models, specifically focusing on Mamba and Mamba2. However, they identified key drawbacks, including a distribution shift when the inference resolution differed from the training resolution and a causal restriction that is unnecessary and counterproductive for diffusion models.

To address these drawbacks, the researchers developed a generalized linear attention paradigm. This paradigm incorporates two key features: attention normalization and non-causal inference. Attention normalization ensures that the total impact of all tokens remains consistent regardless of the input scale, addressing the distribution shift problem. Non-causal inference allows the model to simultaneously access all noisy spatial tokens, removing the unnecessary causal restriction imposed by the Mamba models.

To further improve performance and reduce training costs, the researchers decided to distill knowledge from a pre-trained Stable Diffusion model into LinFusion. This approach allows the researchers to leverage the vast knowledge learned by the pre-trained model, while only needing to train the linear attention modules for 50k iterations. The resulting model delivers performance that is on par with, or superior to, the original SD model, while using significantly less time and memory.

The paper includes extensive experiments demonstrating the effectiveness of LinFusion. These experiments show that LinFusion achieves satisfactory zero-shot cross-resolution generation performance, allowing for the generation of high-resolution images like 16K resolution. LinFusion is also highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts.

LinFusion represents a significant advancement in the field of diffusion models. It offers a highly efficient and scalable solution for generating high-resolution images, pushing the boundaries of what is possible with this technology. The researchers believe that this work will contribute to a new era of AI-generated content where even more stunning and realistic visuals are readily available.

AI Papers Reader

Personalized digests of latest AI research

Scientists Unleash a Diffusion Model that Can Generate 16K Images on a Single GPU in Just One Minute

Chat about this paper