计算机科学
视听
人工智能
计算机视觉
自然语言处理
语音识别
人机交互
多媒体
作者
Qilang Ye,Zitong Yu,Rui Shao,Yawen Cui,Xiangui Kang,Xin Liu,Philip H. S. Torr,Xiaochun Cao
标识
DOI:10.1109/tpami.2025.3582389
摘要
Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual ambiguity, and 2) audio-visual hallucination. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses. To overcome the two aforementioned issues, we introduce the CAT+, which enhances MLLM to ensure more robust multimodal understanding. We first propose the Sequential Question-guided Module (SQM), which combines tiny transformer layers and cascades Q-Formers to realize a solid audio-visual grounding. After feature alignment and high-quality instruction tuning, we introduce Ambiguity Scoring Direct Preference Optimization (AS-DPO) to correct the problem of CAT+ bias toward ambiguous descriptions. To explore the hallucinatory deficits of MLLMs in dynamic audio-visual scenes, we build a new Audio-visual Hallucination Benchmark, named AVHbench. This benchmark detects the extent of MLLM's hallucinations across three different protocols in the perceptual object, counting, and holistic description tasks. Extensive experiments across video-based understanding, open-ended, and close-ended AVQA demonstrate the superior performance of our method. The AVHbench is released at https://github.com/rikeilong/Bay-CAT.
科研通智能强力驱动
Strongly Powered by AbleSci AI