AI Papers Reader

Personalized digests of latest AI research

View on GitHub

CLIP Is Still Blind: A Diffusion Model Makes It See Better

Contrastive Language-Image Pre-training (CLIP) is a popular method for training AI models to understand both images and text. CLIP has proven itself to be a versatile tool for a variety of tasks, including image classification, text-to-image retrieval, and even image generation. However, a recent study published in arXiv by Wenxuan Wang, Quan Sun, Fan Zhang, and colleagues reveals that CLIP has some serious blind spots.

The researchers found that CLIP has trouble recognizing fine-grained visual details, such as the orientation of an object, the number of objects in an image, or even the color of an object. This shortcoming limits the usefulness of CLIP for many tasks, especially those that require a deep understanding of visual information.

The researchers propose a novel approach to address CLIP’s visual shortcomings using diffusion models. Diffusion models are a type of generative AI model that has gained popularity for its ability to produce realistic images. The researchers used a diffusion model to generate new images based on the text descriptions of the images that CLIP has already learned. This process helps CLIP learn how to recognize fine-grained visual details.

The researchers call their new method DIVA, which stands for Diffusion model as a Visual Assistant for CLIP. DIVA was trained on a dataset of images and text descriptions, as well as a separate dataset of images only. The researchers found that DIVA greatly improved CLIP’s ability to recognize fine-grained visual details.

For example, the researchers tested CLIP’s ability to distinguish between two images of a school bus. One image showed the bus parked facing the camera, while the other image showed the bus parked facing away from the camera. Before being trained with DIVA, CLIP could not reliably distinguish between the two images. After being trained with DIVA, CLIP was able to correctly identify the orientation of the bus in both images.

The researchers also found that DIVA improved the performance of CLIP on a variety of other tasks, including multimodal understanding and image classification. DIVA also improved the performance of large language models (LLMs) that use CLIP as a visual encoder.

DIVA is a simple and effective method for improving the visual perception capabilities of CLIP models. This is a significant development in the field of AI, as it could lead to the development of more powerful and accurate AI models.

The researchers have made the code for DIVA available on GitHub, making it possible for others to build on their work. The study is a welcome contribution to the field of AI, and it is sure to inspire further research into the development of more powerful and accurate AI models.