AI’s “Yes-Man” Problem: Why Your Vision Model Sees What Isn't There
Imagine showing an AI a photo of a white cat resting on a blue wooden chair. If you ask, “Is there a wolf in this image?” the model will easily say no. But what if you ask, “Can you see the white cat sitting on the purple and orange chair?”
To a human, the mistake is obvious: the cat is there, but the chair color is wrong. To a cutting-edge Multimodal Large Language Model (MLLM), however, the answer is often a confident—and incorrect—”Yes.”
This phenomenon is the focus of a new paper by researchers from the Technical University of Munich and several leading AI labs. They have introduced FINER (FIne-grained NEgative queRies), a framework designed to expose a critical weakness in AI vision: the tendency to hallucinate when faced with subtle, highly detailed “negative” questions.
The Devil in the Details
Current AI models are surprisingly good at recognizing objects in a “coarse” sense. If you ask about a missing object, they usually catch it. However, the researchers discovered that as questions become more “fine-grained”—incorporating specific attributes or complex relationships—the models’ accuracy plummets.
The team developed a seven-level stress test. At Level 1, they might ask if a non-existent wolf is in a cat photo. Most models pass. By Level 7, they ask about a specific cat with a specific head tilt sitting on a specific piece of furniture, where only one tiny detail (like the furniture color) is wrong.
The results were startling. For one top-tier model, InternVL3.5-14B, accuracy dropped from nearly 80% on simple questions to just 15% on the most detailed ones. The models effectively become “yes-men.” Because most of the query (the cat, the sitting pose) matches the image, the AI ignores the single contradictory detail and confirms the entire statement.
Learning to Say “No”
To fix this, the researchers introduced FINER-Tuning. They realized that simply telling a model “don’t hallucinate” isn’t enough. Instead, they used a technique called Direct Preference Optimization (DPO).
Think of this as a high-speed game of “spot the difference.” The researchers created a massive dataset of 160,000 “preference tuples.” For every image, they presented the model with a correct description and a “near-miss” description—for instance, a “car with a chrome bumper” versus a “car with a yellow bumper.” By training the model to prefer the response that correctly identifies the subtle error, they taught the AI to become a more skeptical observer.
Why It Matters
This isn’t just about misidentifying furniture. The researchers point out that as AI is integrated into high-stakes fields like medical imaging or autonomous driving, the ability to spot a “fine-grained” error is a matter of safety. A model that says “Yes” to an incorrectly described surgical scan is a liability.
The results of FINER-Tuning are promising. Models updated with this method saw accuracy gains of up to 24.2% on the team’s benchmarks. More impressively, these models also improved on eight existing general hallucination tests without losing their ability to answer standard questions.
By teaching AI that “No” is sometimes the smartest answer, the FINER project is moving us closer to vision models that don’t just see the world, but actually understand the nuances within it.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.