模式
模态(人机交互)
计算机科学
代表(政治)
人工智能
情绪分析
语义鸿沟
自然语言处理
一致性(知识库)
特征学习
特征(语言学)
多模态
语义学(计算机科学)
多模式学习
语言学
图像(数学)
万维网
程序设计语言
社会科学
哲学
社会学
政治
政治学
法学
图像检索
作者
Jian Huang,Yanli Ji,Yang Yang,Heng Tao Shen
标识
DOI:10.1145/3581783.3612295
摘要
Effective alignment and fusion of multimodal features remain a significant challenge for multimodal sentiment analysis. In various multimodal applications, the text modal exhibits a significant advantage of compact yet expressive representation ability. In this paper, we propose a Cross-modality Representation Interactive Learning (CRIL) approach, which adopts the text modality to guide other modalities for learning representative feature tokens, contributing to effective multimodal fusion in multimodal sentiment analysis. We propose a semantic representation interactive learning module to learn concise semantic representation tokens for audio and video modalities under the guidance of the text modality, ensuring semantic alignment of representations among multiple modalities. Furthermore, we design a semantic relationship interactive learning module, which calculates a self-attention matrix for each modality and controls their consistency to enable the semantic relationship alignment for multiple modalities. Finally, we present a two-stage interactive fusion solution to bridge the modality gap for multimodal fusion and sentiment analysis. Extensive experiments are performed on the CMU-MOSEI, CMU-MOSI, and UR-FUNNY datasets, and experiment results demonstrate the effectiveness of our proposed approach.
科研通智能强力驱动
Strongly Powered by AbleSci AI