计算机科学
人工智能
因果模型
一般化
模态(人机交互)
图像(数学)
图形
图像检索
语义学(计算机科学)
任务(项目管理)
自然语言处理
模式识别(心理学)
理论计算机科学
数学
数学分析
统计
管理
经济
程序设计语言
作者
Weijia Feng,Dazhen Lin,Donglin Cao
标识
DOI:10.1007/978-981-99-8429-9_17
摘要
Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI