Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion

情绪识别特征（语言学）语音识别计算机科学融合集成学习深度学习人工智能模式识别（心理学）语言学哲学

作者

Mengsheng Wang,Hongbin Ma,Yingli Wang,Xian–He Sun

出处

期刊：Applied Acoustics [Elsevier]
日期：2024-01-31 卷期号：218: 109886-109886 被引量：8

标识

DOI：10.1016/j.apacoust.2024.109886

摘要

In the realm of consumer technology, Artificial Intelligence (AI)-based Speech Emotion Recognition (SER) has rapidly gained traction and integration into smart home systems. Its precision in recognition has become a pivotal factor significantly impacting user experience. However, the intricate task of selecting suitable features has emerged as a daunting challenge due to the variances in speech features induced by emotional nuances. Present research predominantly concentrates on localized speech characteristics, neglecting the broader contextual cues inherent in speech signals. This oversight contributes to relatively diminished accuracy in emotion recognition within smart home systems. To tackle this challenge, this paper introduces an enhanced Speech Emotion Recognition approach named TF-Mix. This methodology enriches emotional prediction from speech by leveraging audio data augmentation and embracing multiple features, thereby achieving superior performance in emotion recognition. To augment the model's adaptability, TF-Mix adeptly amalgamates various feature extraction techniques, encompassing Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and Transformer architecture. The synergy among these methodologies culminates in the formulation of three distinct architectural models. The primary architecture is founded on a 1-dimensional Convolutional Neural Network (CNN), closely followed by a Fully Connected Network (FCN). Subsequent architectures, notably BiLSTM-FCN and BiLSTM-Transformer-FCN, retain their respective structures while incorporating CNNs. Moreover, the amalgamation of individual models into an ensemble model, designated as D, via weighted averaging, further amplifies the efficacy of emotion recognition. Experimental outcomes showcase exceptional performance across all four models in the SER task. The ensemble Model D achieves noteworthy accuracy across multiple datasets: 87.513% on RAVDESS, 86.233% on SAVEE, 99.857% on TESS, 82.295% on CREMA-D, and 97.546% on the TOTAL dataset.

求助该文献

最长约 10秒，即可获得该文献文件

Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion

今日热心研友