借记
计算机科学
情态动词
特征(语言学)
对话框
样品(材料)
人工智能
融合
计算机视觉
模式识别(心理学)
人机交互
心理学
材料科学
万维网
化学
语言学
哲学
色谱法
高分子化学
认知科学
作者
Chenyu Lu,Jun Yin,Hao Yang,Shiliang Sun
标识
DOI:10.1016/j.inffus.2024.102302
摘要
Visual dialog aims to accomplish multiple rounds of dialog by fusing information extracted from images, captions, and previous question-answer pairs. As a vision-language task, visual dialog encounters challenges related to language bias and vision bias. These biases create an imbalance in multi-modal fusion, resulting in shortcut learning and significantly compromising the model’s robustness. Moreover, existing multi-modal fusion methods in visual dialog exhibit a low data interaction frequency, leading to insufficient fusion. To overcome the balance and sufficiency issues in multi-modal fusion, we propose a novel Parallel Attention Fusion visual dialog model with Counterfactual Sample debiasing (CS-PAF). Specifically, CS-PAF consists of two core ingredients: (i) a counterfactual sample generation module for model debiasing; and (ii) a parallel attention fusion network that enhances sufficiency in multi-modal data interaction. Notably, in contrast to other debiasing methods, our counterfactual sample generation applies contrastive learning to circumvent the high cost of manual annotations and ensure seamless integration with other models. Extensive comparisons with state-of-the-art approaches, along with comprehensive ablation and transferability studies across multiple datasets, substantiate the superiority and effectiveness of our CS-PAF. Our implement code is available at https://github.com/chenyulu2000/cspaf.
科研通智能强力驱动
Strongly Powered by AbleSci AI