模态(人机交互)
计算机科学
模式
嵌入
人工智能
缺少数据
情绪分析
稳健性(进化)
自然语言处理
语音识别
机器学习
社会科学
生物化学
基因
社会学
化学
作者
Wei Luo,Mengying Xu,Hanjiang Lai
标识
DOI:10.1007/978-3-031-27818-1_34
摘要
Multimodal Sentiment Analysis (MSA) aims at recognizing emotion categories by textual, visual, and acoustic cues. However, in real-life scenarios, one or two modalities may be missing due to various reasons. And when text modality is missing, obvious deterioration will be observed since text modality contains much more semantic information compared to vision and audio modality. To this end, we propose the Multimodal Reconstruct and Align Net (MRAN) to tackle the missing modality problem, especially to relieve the decline caused by the text modality’s absence. We first propose the Multimodal Embedding and Missing Index Embedding to guide the reconstruction of missing modalities features. Then, visual and acoustic features are projected to the textual feature space, and all three modalities’ features are learned to be close to the word embedding of their corresponding emotion category, making visual and acoustic features aligned with textual features. In this text-centered way, vision and audio modality benefit from the more informative text modality. Thus it improves the robustness of the network for different modality missing conditions, especially when text modality is missing. Experimental results conducted on two multimodal benchmarks IEMOCAP and CMU-MOSEI show that our method outperforms baseline methods, gaining superior results on different kinds of modality missing conditions.
科研通智能强力驱动
Strongly Powered by AbleSci AI