计算机科学
光谱图
Mel倒谱
语音识别
卷积神经网络
人工智能
模式识别(心理学)
支持向量机
深度学习
分类器(UML)
特征(语言学)
特征提取
频域
计算机视觉
语言学
哲学
作者
F. Javanmardi,Sudarsana Reddy Kadiri,Paavo Alku
标识
DOI:10.1016/j.csl.2023.101552
摘要
To distinguish pathological voices from healthy voices, automatic voice pathology detection systems can be built using machine learning (ML) and deep learning (DL) techniques. To fully exploit such systems, large quantities of training data are typically required. The amount of training data is, however, small in the area of pathological voice, and therefore data augmentation (DA) becomes a potential technology to artificially increase the quantity of training data. This study presents a systematic comparison between various DA methods in the detection of pathological voice, including three time domain methods (noise addition, pitch shifting and time stretching), one time-frequency domain method (SpecAugment), and two vocoder-based methods (harmonic-to-noise ratio (HNR) modification and glottal pulse length modification). Detection systems were built using four popular spectral feature representations (static mel-frequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and mel-spectrogram). As classifiers, two widely used ML models (support vector machine (SVM) and random forest (RF)) and two DL models (long short-term memory (LSTM) network and convolutional neural network (CNN) with 1-dimensional (1-D) and 2-dimensional (2-D) architectures) were used. These systems were trained using a small number of training samples from two popular databases of pathological voice (HUPA and SVD) to find the best feature/classifier combination for each database. As a result, one ML-based detection system (mel-spectrogram/SVM for HUPA and SVD) and two DL-based detection systems (dynamic MFCCs/2-D CNN for HUPA and mel-spectrogram/2-D CNN for SVD) were selected for the comparison of the DA methods. The results show that by using DA in the system training, detection accuracy increased compared to the baseline systems that were trained without using DA. This improvement in accuracy was, however, clearly larger for the 2D-CNN system than for the SVM system. Furthermore, all six DA methods improved accuracy of the 2-D CNN system compared to the baseline system for both databases. The highest improvements were achieved using the time-frequency domain SpecAugment DA method, which improved accuracy by 1.5% and 3.8% (absolute) for the HUPA and SVD database, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI