计算机科学
人工智能
特征学习
模式识别(心理学)
特征提取
编码器
卷积神经网络
利用
自编码
多模式学习
模式
深度学习
图形
机器学习
理论计算机科学
社会科学
计算机安全
社会学
操作系统
作者
Chujie Xu,Yong Du,Jingzi Wang,Wenjie Zheng,Tiejun Li,Zhan-Sheng Yuan
摘要
Abstract Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning‐based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different information, these methods ignore the complementarity between modalities. To this end, the joint cross‐attention model is introduced to alleviate this issue. However, multi‐scale feature information on different modalities is not utilized. Moreover, the context relationship plays an important role in feature extraction in the expression recognition task. In this paper, we propose a novel joint hierarchical graph convolution network (JHGCN) which exploits different layer features and context relationships for facial expression recognition based on audio‐visual (A‐V) information. Specifically, we adopt different deep networks to extract features from different modalities individually. For V modality, we construct V graph data based on patch embeddings which are extracted from the transformer encoder. Moreover, we embed the graph convolution which can leverage the intra‐modality relationships with the transformer encoder. Then, the deep feature from different layers is fed to the hierarchical fusion module to enhance feature representation. At last, we use the joint cross‐attention mechanism to exploit the complementary inter‐modality relationships. To validate the proposed model, we have conducted various experiments on the AffWild2 and CMU‐MOSI datasets. All results confirm that our proposed model achieves highly promising performance compared to the joint cross‐attention model and other methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI