计算机科学
光谱图
卷积神经网络
任务(项目管理)
一般化
人工智能
人工神经网络
语音识别
深度学习
特征(语言学)
滤波器(信号处理)
频道(广播)
模式识别(心理学)
计算机视觉
工程类
数学
语言学
数学分析
哲学
系统工程
计算机网络
作者
Zhentao Liu,Mengting Han,Bao-Han Wu,Abdul Rehman
标识
DOI:10.1016/j.apacoust.2022.109178
摘要
Speech emotion recognition (SER) is a challenging task since the distribution of the features is different among various people. To improve generalization performance and accuracy of SER, we employ balanced augmented sampling on the triple-channel log-Mel spectrograms to improve the imbalance of the sample distribution among emotional categories and provide sufficient inputs for the deep neural network model. Time-domain filter and frequency-domain filter are used to process the triple-channel log-Mel spectrograms respectively in order to increase the diversity of features. After that, a deep neural network composed of convolutional neural network (CNN) and attention-based bidirectional long short-term memory network (ABLSTM) is employed for feature extraction, in which multi-task learning is adopted to improve the performance of the deep neural network. We select seven auxiliary tasks and determine the optimal auxiliary tasks through experimental comparison. Finally, our method is evaluated on IEMOCAP and MSP-IMPROV database, and it achieves 70.27% and 66.27% in terms of WAR and UAR on IEMOCAP database, while the WAR and UAR are 60.90% and 61.83% on MSP-IMPROV database respectively, which demonstrates its better performance than other works.
科研通智能强力驱动
Strongly Powered by AbleSci AI