计算机科学
情绪识别
人工智能
模式识别(心理学)
特征(语言学)
特征提取
融合
语音识别
计算机视觉
语言学
哲学
作者
Omkumar Chandraumakantham,N Gowtham,Mohammed Zakariah,Abdulaziz S. Almazyad
出处
期刊:IEEE Access
[Institute of Electrical and Electronics Engineers]
日期:2024-01-01
卷期号:12: 108052-108071
被引量:18
标识
DOI:10.1109/access.2024.3425953
摘要
Multimodal emotion recognition is a developing field that analyzes emotions through various channels, mainly audio, video, and text. However, existing state-of-the-art systems focus on two to three modalities at the most, utilize traditional techniques, fail to consider emotional interplay, lack the scope to add more modalities, and aren’t efficient in predicting emotions accurately. This research proposes a novel approach using rule-based systems to convert non-verbal cues to text, inspired by a limited prior attempt that lacked proper benchmarking. It achieves efficient multimodal emotion recognition by utilizing distilRoBERTa, a large language model fine-tuned with a combined textual representation of audio (such as loudness, spectral flux, MFCCs, pitch stability, and emphasis) and visual features (action units) extracted from videos. This approach is evaluated using the datasets RAVDESS and BAUM-1. It achieves high accuracy (93.18% in RAVDESS and 93.69% in BAUM-1) on both datasets, performing on par with the SOTA (state-of-the-art) systems, if not slightly better. Furthermore, the research highlights the potential for incorporating additional modalities by transforming them into text using rule-based systems and utilizing them to refine further pre-trained large language models, giving rise to a more comprehensive approach to emotion recognition.
科研通智能强力驱动
Strongly Powered by AbleSci AI