Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

🔊

💬 Ask

Text-to-image (T2I) diffusion models are quite good at generating images from text prompts, but ensuring that the generated image faithfully aligns with the prompt’s semantics remains a challenge. Recent works have attempted to improve faithfulness by optimizing the latent code, but this can sometimes lead to unrealistic images.

A new paper titled “FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting” proposes a simple but effective approach that adaptively adjusts the per-token prompt weights throughout the image generation process to improve prompt-image alignment and image quality. This method doesn’t rely on changing the latent code, which avoids the risk of generating unrealistic images.

FRAP works by minimizing a unified objective function that encourages the presence of objects mentioned in the prompt and the correct binding of object-modifier pairs. For example, a prompt like “a brown dog chasing a white cat” would need to ensure that the dog is brown, the cat is white, and that the dog is actually chasing the cat.

To adaptively adjust the prompt weights, FRAP uses an online algorithm that updates each token’s weight coefficient at each step of the reverse generative process (RGP). This online optimization is performed during inference, which means it doesn’t require any additional training.

The paper demonstrates that FRAP generates images with significantly higher prompt-image alignment compared to recent methods, especially on complex datasets with challenging prompts.

For example, when tested on the COCO-Subject dataset, FRAP achieved significantly higher faithfulness compared to the baseline methods:

FRAP: 0.837
D&B: 0.835
A&E: 0.829

FRAP was also found to be more efficient than other methods, with an average latency of 14.0 seconds on the COCO-Subject dataset, compared to 18.1 seconds for D&B and 19.1 seconds for A&E.

Furthermore, FRAP can be easily integrated with prompt rewriting LLMs to recover degraded prompt-image alignment.

The paper concludes that FRAP is a simple yet effective approach to improving the faithfulness and authenticity of T2I diffusion models. This method provides a significant improvement over existing techniques and offers a promising direction for future research in this area.

AI Papers Reader

Personalized digests of latest AI research

Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Chat about this paper