Attention guided learnable time-domain filterbanks for speech depression detection

计算机科学人工智能特征（语言学）语音识别领域（数学分析）监督学习比例（比率）透视图（图形）模式识别（心理学）机器学习人工神经网络数学哲学数学分析物理量子力学语言学

作者

Wenju Yang,Jiankang Liu,Peng Cao,Rongxin Zhu,Yang Wang,Jian K. Liu,Fei Wang,Xizhe Zhang

出处

期刊：Neural Networks [Elsevier BV]
日期：2023-05-27 卷期号：165: 135-149 被引量：13

链接

nih.govdoi.org

标识

DOI：10.1016/j.neunet.2023.05.041

摘要

Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600-700Hz, which corresponds to the Mandarin vowels /e/ and /eˆ/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.

求助该文献

最长约 10秒，即可获得该文献文件

Attention guided learnable time-domain filterbanks for speech depression detection

今日热心研友