模式
计算机科学
杠杆(统计)
人工智能
价(化学)
视听
语音识别
情态动词
机器学习
模式识别(心理学)
多媒体
社会科学
物理
化学
量子力学
社会学
高分子化学
作者
R. Gnana Praveen,Patrick Cardinal,Éric Granger
出处
期刊:IEEE transactions on biometrics, behavior, and identity science
[Institute of Electrical and Electronics Engineers]
日期:2023-01-04
卷期号:5 (3): 360-373
被引量:22
标识
DOI:10.1109/tbiom.2022.3233083
摘要
Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion .
科研通智能强力驱动
Strongly Powered by AbleSci AI