计算机科学
模式
语音识别
深度学习
人工智能
情绪识别
视听
多模式学习
模态(人机交互)
人机交互
机器学习
多媒体
社会科学
社会学
作者
Asif Iqbal Middya,B. Nag,Sarbani Roy
标识
DOI:10.1016/j.knosys.2022.108580
摘要
Emotion identification based on multimodal data (e.g., audio, video, text, etc.) is one of the most demanding and important research fields, with various uses. In this context, this research work has conducted a rigorous exploration of model-level fusion to find out the optimal multimodal model for emotion recognition using audio and video modalities. More specifically, separate novel feature extractor networks for audio and video data are proposed. After that, an optimal multimodal emotion recognition model is created by fusing audio and video features at the model level. The performances of the proposed models are assessed based on two benchmark multimodal datasets namely Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS) and Surrey Audio–Visual Expressed Emotion (SAVEE) using various performance metrics. The proposed models achieve high predictive accuracies of 99% and 86% on the SAVEE and RAVDESS datasets, respectively. The effectiveness of the models are also verified by comparing their performances with the existing emotion recognition models. Some case studies are also conducted to explore the model’s ability to capture the variability of emotional states of the speakers in publicly available real-world audio–visual media.
科研通智能强力驱动
Strongly Powered by AbleSci AI