MLLMs Can See? Dynamic Correction Decoding for Hallucination Mitigation
đź“„ Full Paper
đź’¬ Ask
Multimodal Large Language Models (MLLMs) are revolutionizing the field of artificial intelligence, but their ability to accurately describe visual scenes is hampered by a phenomenon called hallucination. This means that MLLMs sometimes invent details in their descriptions that are not present in the image.
A new paper by researchers at Zhejiang University and the National University of Singapore has shown that MLLMs can actually recognize visual objects, but this information may be suppressed by the model’s language priors. The researchers propose a method called Dynamic Correction Decoding with Preceding-Layer Knowledge (DeCo) that mitigates hallucinations by integrating visual information from earlier layers of the model into the final output.
“Hallucinations are a real problem for MLLMs, but our research shows that they’re not blind,” says Chenxi Wang, lead author of the paper. “We found that MLLMs can actually see visual objects, but this information is sometimes suppressed by the model’s language priors.”
To illustrate this, the researchers conducted an experiment in which they probed the internal representations of a popular MLLM called LLaVA. They found that the model accurately recognized objects in the early layers, but as the information flowed through the model, the confidence in these objects declined, eventually resulting in hallucinations.
To address this issue, DeCo dynamically selects a layer from the preceding layers of the MLLM, where the confidence of the true objects is higher, and incorporates this information into the final output. “DeCo is like giving the MLLM a second chance to see what it’s missing,” explains Wang.
The researchers tested DeCo on several benchmark datasets, including CHAIR and POPE, and found that it significantly reduced hallucination rates across multiple MLLMs. For example, on the CHAIR dataset, DeCo achieved an average hallucination suppression rate of 10.8%.
DeCo is a model-agnostic method and can be seamlessly incorporated into any MLLM. The researchers also highlight that DeCo is computationally efficient and can be implemented with minimal overhead.
This research sheds new light on the inner workings of MLLMs and demonstrates that these models possess a surprising degree of visual awareness. DeCo offers a practical solution for mitigating hallucinations and will likely play a crucial role in the development of more reliable and robust multimodal systems.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.