计算机科学
判别式
模式
模态(人机交互)
人工智能
多模式学习
代表(政治)
情绪分析
多模态
自然语言处理
政治学
社会科学
政治
万维网
社会学
法学
作者
Guofeng Yi,Cunhang Fan,Kun Zhu,Zhao Lv,Shan Liang,Z. L. Wen,Guanxiong Pei,Taihao Li,Zhao Lv
标识
DOI:10.1016/j.knosys.2023.111136
摘要
Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI