计算机科学
情态动词
情绪分析
图像(数学)
人工智能
自然语言处理
情报检索
计算机视觉
人机交互
化学
高分子化学
作者
Xintao Lu,Yonglong Ni,Zuohua Ding
标识
DOI:10.14569/ijacsa.2024.0150290
摘要
Multimodal sentiment analysis is a traditional text-based sentiment analysis technique. However, the field of multi-modal sentiment analysis still faces challenges such as inconsistent cross-modal feature information, poor interaction capabilities, and insufficient feature fusion. To address these issues, this paper proposes a cross-modal sentiment model based on CLIP image-text attention interaction. The model utilizes pre-trained ResNet50 and RoBERTa to extract primary image-text features. After contrastive learning with the CLIP model, it employs a multi-head attention mechanism for cross-modal feature interaction to enhance information exchange between different modalities. Subsequently, a cross-modal gating module is used to fuse feature networks, combining features at different levels while controlling feature weights. The final output is fed into a fully connected layer for sentiment recognition. Comparative experiments are conducted on the publicly available datasets MSVA-Single and MSVA-Multiple. The experimental results demonstrate that our model achieved accuracy rates of 75.38%and 73.95% , and F1-scores of 75.21% and 73.83% on the mentioned datasets, respectively. This indicates that the proposed approach exhibits higher generalization and robustness compared to existing sentiment analysis models.
科研通智能强力驱动
Strongly Powered by AbleSci AI