AI Papers Reader

Personalized digests of latest AI research

View on GitHub

IterComp: A New Framework for Compositional Text-to-Image Generation

Diffusion models are revolutionizing the field of text-to-image generation. These models, such as Stable Diffusion 3 and FLUX, excel in generating realistic and diverse images. However, when faced with complex prompts that require multiple objects and their relationships, they struggle. This challenge stems from the complexity of compositional generation, which involves multiple aspects, such as attribute binding, spatial relationships, and non-spatial relationships.

To overcome these limitations, a team of researchers from Tsinghua University, Peking University, LibAI Lab, and USTC have proposed a new framework called IterComp. This framework aims to improve the compositional capabilities of diffusion models by leveraging the complementary strengths of multiple models.

IterComp works by curating a gallery of six powerful open-source diffusion models, each of which excels in different aspects of compositional generation. For example, some models excel at accurately depicting attributes while others are better at capturing spatial relationships. By evaluating these models on their performance in attribute binding, spatial relationships, and non-spatial relationships, the researchers created a composition-aware model preference dataset. This dataset comprises numerous image-rank pairs, which illustrate the relative strength of each model for different compositional tasks.

The research team then trained composition-aware reward models using this dataset. These models learn to provide fine-grained guidance during the fine-tuning of the base diffusion model, helping to improve its compositional generation abilities.

Finally, the team introduced iterative feedback learning, which iteratively refines both the base diffusion model and the reward models, leading to a continuous improvement in compositional generation. Through this closed-loop learning process, IterComp enables progressive self-refinement of both the base diffusion model and the reward models.

The researchers conducted extensive experiments to validate the efficacy of IterComp, comparing it with other state-of-the-art methods such as FLUX, SDXL, and RPG. The results demonstrate that IterComp consistently outperforms these baselines in terms of compositional generation and image realism. In addition, IterComp exhibits significant speedups in inference time.

Furthermore, the researchers conducted a user study to evaluate the user perception of IterComp. This study revealed that users consistently preferred images generated by IterComp over those generated by other methods, indicating its superior ability to generate images that accurately reflect the intended composition.

IterComp represents a significant advancement in the field of text-to-image generation, paving the way for a new generation of diffusion models that can seamlessly generate high-quality and compositionally accurate images. This framework opens up exciting research avenues in reward feedback learning for diffusion models and compositional generation.