计算机科学
帧(网络)
透视图(图形)
卷积神经网络
特征(语言学)
语音识别
任务(项目管理)
人工智能
领域(数学分析)
模式识别(心理学)
工程类
电信
语言学
数学分析
哲学
系统工程
数学
作者
Guoyan Li,Jin Hou,Yi Liu,Jianguo Wei
标识
DOI:10.1016/j.apacoust.2023.109658
摘要
Speech emotion recognition (SER) is a crucial and challenging task in affective computing due to the intricacy and variability inherent in speech. In this paper, a novel method (i.e., MPAF-CNN) combines a convolutional neural network (CNN)-based multiperspective aware module (MPAM) and a frame-level fine-grained fusion strategy (FFS) for SER by utilizing speech information. MPAM perceives the emotional information embedded in speech from three main perspectives: local, frame-level, and global. Specifically, this module introduces the multiscale idea of perceiving multi-granular emotional information under different local sensory fields from the local perspective; a novel frame-level aggregated attention is proposed in this paper, aiming to learn the intrinsic emotional associations of intermediate features from the frame-level perspective, enhance the model's attention to emotionally informative frames, and improve the emotional expression of intermediate features; in the global perspective, multiple layers of global intermediate features are aggregated from the time domain, frequency domain, or channel to enhance the model's ability to extract and express global feature information. A new frame-level fine-grained fusion strategy is proposed to employ an attention mechanism to model the interaction of emotional representations from different acoustic features at the frame level, capturing their underlying relationships and thus further improving the overall performance of the model. The experimental results show that our method has excellent performance in recognizing speech emotions, and MPAF-CNN obtains 72.19% and 72.88% recognition accuracy on the IEMOCAP database.
科研通智能强力驱动
Strongly Powered by AbleSci AI