ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models
解码方法
幻觉
计算机科学
心理学
可视化
语言学
认知心理学
人工智能
哲学
算法
精神科
作者
Yeji Park,Deokyeong Lee,Junsuk Choe,Buru Chang
出处
期刊:Proceedings of the ... AAAI Conference on Artificial Intelligence [Association for the Advancement of Artificial Intelligence (AAAI)] 日期:2025-04-11卷期号:39 (6): 6434-6442
标识
DOI:10.1609/aaai.v39i6.32689
摘要
Hallucinations in Multimodal Large Language Models (MLLMs) where generated responses fail to accurately reflect the given image pose a significant challenge to their reliability. To address this, we introduce ConVis, a novel training-free contrastive decoding method. ConVis leverages a text-to-image (T2I) generation model to semantically reconstruct the given image from hallucinated captions. By comparing the contrasting probability distributions produced by the original and reconstructed images, ConVis enables MLLMs to capture visual contrastive signals that penalize hallucination generation. Notably, this method operates purely within the decoding process, eliminating the need for additional data or model updates. Our extensive experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs, highlighting its potential to enhance model reliability.