AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Vision-SR1 Tackles Visual Hallucinations with Self-Rewarding AI

Vision-Language Models (VLMs) have made impressive strides in understanding and responding to visual information combined with text. However, a persistent challenge is their tendency towards “visual hallucinations” – describing things not present in an image – and “language shortcuts,” where they rely more on pre-existing text knowledge than on the actual visual content. This paper introduces Vision-SR1, a novel self-rewarding framework designed to improve VLMs’ visual reasoning abilities without the need for costly external human annotations or pre-defined supervision signals.

The core innovation of Vision-SR1 lies in its method of decomposing the VLM’s reasoning process into two distinct stages: visual perception and language reasoning. During training, the VLM first generates a detailed, self-contained “visual perception” of the input image. This perception is intended to be comprehensive enough to answer the question without needing to refer back to the original image.

To ensure the quality of this visual perception, Vision-SR1 employs a clever “self-reward” mechanism. The same VLM is then prompted to answer the question using only its generated visual perception as input. If the VLM can correctly answer the question based solely on its self-generated description, it receives a positive reward, reinforcing the accuracy and completeness of its visual perception. This self-reward signal is then combined with a standard reward for the final answer’s correctness, creating a balanced training signal that guides the model to improve both its visual understanding and its reasoning capabilities.

For instance, consider an image of several nested Matryoshka dolls and the question, “How many Matryoshka dolls are here?” A typical VLM might hallucinate or rely on typical Russian nesting doll counts. However, Vision-SR1 would first prompt the model to describe the scene, perhaps noting, “There appear to be two distinct sets of Matryoshka dolls, with a total of seven visible dolls, including the smallest nested ones.” If, when given only this description and the question, the VLM correctly answers “7,” it receives a self-reward. This process encourages the model to be more grounded in the visual details rather than making assumptions.

The researchers demonstrated Vision-SR1’s effectiveness across various vision-language tasks, showing significant improvements in visual reasoning and a notable reduction in visual hallucinations and language shortcuts. The paper highlights that Vision-SR1’s approach strengthens the crucial link between visual perception and linguistic output, leading to more reliable and accurate responses. This self-rewarding strategy represents a promising direction for developing more robust and trustworthy AI systems that truly “see” and understand the world.