计算机科学
模式
稳健性(进化)
融合
模态(人机交互)
视听
人工智能
同种类的
信息融合
传感器融合
多媒体
社会科学
生物化学
化学
语言学
哲学
物理
社会学
基因
热力学
作者
Ziwang Fu,Feng Liu,Qing Xu,Jiayin Qi,Xiangling Fu,Aimin Zhou,Zhibin Li
标识
DOI:10.1109/icme52920.2022.9859836
摘要
Fusion technology is crucial for multimodal sentiment analysis. Recent attention-based fusion methods demonstrate high performance and strong robustness. However, these approaches ignore the difference in information density among the three modalities, i.e., visual and audio have low-level signal features and conversely text has high-level semantic features. To this end, we propose a non-homogeneous fusion network (NHFNet) to achieve multimodal information interaction. Specifically, a fusion module with attention aggregation is designed to handle the fusion of visual and audio modalities to enhance them to high-level semantic features. Then, cross-modal attention is used to achieve information reinforcement of text modality and audio-visual fusion. NHFNet compensates for the differences in information density of different modalities enabling their fair interaction. To verify the effectiveness of the proposed method, we set up the aligned and unaligned experiments on the CMU-MOSEI dataset, respectively. The experimental results show that the proposed method outperforms the state-of-the-art. Codes are available at https://github.com/skeletonNN/NHFNet.
科研通智能强力驱动
Strongly Powered by AbleSci AI