计算机科学
变压器
窗口(计算)
语音识别
建筑
人工智能
工程类
艺术
电压
电气工程
视觉艺术
操作系统
作者
Shuaiqi Chen,Xiaofen Xing,Wei-Bin Zhang,Weidong Chen,Xiangmin Xu
标识
DOI:10.1109/icassp49357.2023.10094651
摘要
Speech emotion recognition is crucial to human-computer interaction. The temporal regions that represent different emotions scatter in different parts of the speech locally. Moreover, the temporal scales of important information may vary over a large range within and across speech segments. Although transformer-based models have made progress in this field, the existing models could not precisely locate important regions at different temporal scales. To address the issue, we propose Dynamic Window transFormer (DWFormer), a new architecture that leverages temporal importance by dynamically splitting samples into windows. Self-attention mechanism is applied within windows for capturing temporal important information locally in a fine-grained way. Cross-window information interaction is also taken into account for global communication. DWFormer is evaluated on both the IEMO-CAP and the MELD datasets. Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI