Vision Language Models Show Surprising Shape Bias, and It Can Be Controlled With Language
Vision language models (VLMs), which combine the power of computer vision and natural language processing, are transforming the field of artificial intelligence. A new paper, “Are Vision Language Models Texture or Shape Biased and Can We Steer Them?” published on arXiv, investigates a fundamental aspect of these models: their bias towards recognizing objects based on texture versus shape. The researchers’ findings reveal a surprising dominance of shape bias in VLMs, and importantly, that this bias can be subtly manipulated using language prompts.
The Texture vs. Shape Bias:
Humans excel at object recognition, primarily relying on an object’s overall shape. However, earlier studies have shown that traditional computer vision models often exhibit a “texture bias,” favoring local textural details over global shape when identifying objects. For example, a vision model might classify an image as “bottle” because of its textured surface, even if the overall shape strongly suggests it’s an “elephant.”
VLMs Favor Shape:
The researchers evaluated a wide range of popular VLMs, including models like GPT-4V, LLaMA, and InternVL-Chat. Surprisingly, they found that most VLMs demonstrated a stronger preference for shape over texture than their underlying vision encoders alone. This suggests that the integration of language into the model somehow modulates the inherent visual bias.
Concrete Example: Imagine showing a VLM an image of an elephant with a bottle-like texture applied to its surface. A purely texture-biased model might classify it as a bottle. However, the studied VLMs, in most cases, more often correctly identify the object as an elephant, indicating a shift towards the human-like shape bias.
Steering the Bias with Language:
The most significant finding is the researchers’ demonstration that they could steer the shape bias of VLMs using carefully crafted language prompts. For example, by explicitly prompting the VLM to focus on the “shape” of an object, they could significantly increase the model’s reliance on shape for classification. Conversely, prompting the model to focus on “texture” could shift the bias in the opposite direction.
Quantifiable Results: The researchers quantitatively measured the shape bias, showing that they could adjust the bias of a given model from as low as 49% towards shape to as high as 72% solely through prompt manipulation. While impressive, this range still falls short of the human-level 96% shape bias.
Mechanism and Implications: The paper explores potential mechanisms for this language-induced bias modulation. They investigated the role of the vision encoder (the part of the model responsible for image processing), finding that while VLMs are clearly influenced by their encoders, the multimodal fusion with the language model allows a degree of bias independence. This suggests future possibilities for more closely aligning visual AI models to human perception. The ability to steer bias through language opens exciting possibilities for controlling model behavior and potentially mitigating undesirable biases in AI systems. Furthermore, this research highlights the importance of considering multimodal interactions when analyzing and improving the performance of visual AI.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.