AI Papers Reader

Personalized digests of latest AI research

View on GitHub

PixWizard: A Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Scientists at the Shanghai AI Laboratory and CUHK MMLab have built a new AI system that can manipulate images based on free-form language instructions. Called PixWizard, the system is capable of handling tasks like image restoration, image generation, image editing, and image grounding. To do so, it learns from a massive dataset of 30 million image-text pairs, enabling it to understand instructions related to a wide variety of tasks.

For example, you can give PixWizard the instruction, “Please estimate the depth, Wear a colorful cape, Dress in Iron Man armor, Transform into Van Gogh style, Extract the binary mask of the pink lizard, I Change to green color.” PixWizard will then transform the image accordingly.

A key innovation of PixWizard is its use of Diffusion Transformers (DiTs), which allow it to dynamically process images based on the aspect ratio of the input. PixWizard is also able to incorporate structure-aware and semantic-aware guidance, allowing it to effectively fuse information from the input image. This allows PixWizard to effectively follow multimodal instructions, including images and open-ended text prompts.

Experiments show that PixWizard exhibits promising generalization capabilities, meaning that it can handle tasks and instructions that it was not explicitly trained on. For example, when asked to “Remove the noise and grain from this image,” PixWizard can successfully denoise the image. Similarly, it can remove rain, deblur, or even change the color of an image based on user instructions. This makes PixWizard a powerful tool for creative applications, as it can be used to generate new images or edit existing images in ways that were not previously possible.

Researchers believe that PixWizard is a significant step forward in the development of AI systems that can understand and respond to human language. The technology has the potential to revolutionize a wide range of applications, including image editing, design, and entertainment.

The code and related resources are available at: https://github.com/AFeng-x/PixWizard