Generative Photomontage: Composing Images from Multiple Generated Results

🔊

💬 Ask

Text-to-image models have revolutionized image creation, allowing users to generate images based on simple text prompts or sketches. However, achieving a single image that captures all the user’s desired features can be challenging, like trying to roll the perfect dice.

This week, researchers at Carnegie Mellon University and Reichman University introduce “Generative Photomontage,” a framework that allows users to create their desired image by combining different parts of multiple generated images. Imagine you want to generate an image of a robot from the future, but you want it to have a specific hand from one generated image, a specific head from another, and a certain background from a third. Generative Photomontage empowers you to do just that.

The framework starts with a stack of images generated using the same input conditions, such as a text prompt and a sketch, but with different random seeds. This stack essentially represents multiple dice rolls. The user can then select desired parts from each image using a brush stroke interface. The system then intelligently segments the image parts based on their features in the diffusion space, a representation of the image that captures semantic information beyond pixel colors.

A novel feature-space blending method then seamlessly composites the selected regions, preserving their original realism and fidelity while blending them harmoniously. This approach gives users fine-grained control over the final image, allowing them to create new appearance combinations, correct shapes and artifacts, and improve prompt alignment.

For instance, if a user wants to correct the shape of a generated waffle, they can select a more desirable background region from a different image in the stack and blend it into the original image. Similarly, if a user wants to combine the body, ear, and arm of a robot from different images to create a unique robot, Generative Photomontage makes that possible.

The researchers demonstrate that their method outperforms existing image blending methods, such as gradient-domain blending and noise-based blending, in preserving local appearances and blending harmoniously. They also show that their method is more robust to user errors in selecting regions than other methods, such as a modified version of Segment Anything (SAM), a popular segmentation tool.

Overall, Generative Photomontage offers a powerful and flexible framework for creating desired images by combining multiple generated results, pushing the boundaries of user control in text-to-image models. This approach has the potential to transform how we interact with AI-powered image generation tools, allowing us to achieve more creative and nuanced results with greater ease.

AI Papers Reader

Personalized digests of latest AI research

Generative Photomontage: Composing Images from Multiple Generated Results

Chat about this paper