光谱图
计算机科学
Mel倒谱
语音识别
话语
人工智能
杠杆(统计)
模式识别(心理学)
序列(生物学)
特征提取
遗传学
生物
作者
Jianyou Wang,Michael Xue,Ryan Culhane,Enmao Diao,Jie Ding,Vahid Tarokh
标识
DOI:10.1109/icassp40776.2020.9054629
摘要
Speech Emotion Recognition (SER) has emerged as a critical component of the next generation of human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3% - a 6% improvement over current state-of-the-art unimodal models - and is comparable with multimodal models that leverage textual information as well as audio signals.
科研通智能强力驱动
Strongly Powered by AbleSci AI