MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

计算机科学卷积神经网络判别式变压器编码器嵌入说话人识别语音识别特征提取字错误率模式识别（心理学）人工智能操作系统物理量子力学电压

作者

Qiuyu Zheng,Zengzhao Chen,Zhifeng Wang,Hai Liu,Mingxing Lin

出处

期刊：Expert Systems With Applications [Elsevier]
日期：2024-06-01 卷期号：244: 123004-123004

标识

DOI：10.1016/j.eswa.2023.123004

摘要

Transformer models have demonstrated superior performance across various domains, including computer vision, natural language processing, and speech recognition. The success of these models can be attributed to their robust parallel capacity and high computation speed, primarily reliant on the attention layer. In the domain of speaker recognition, state-of-the-art results have been achieved using convolutional neural network (CNN) architectures, particularly with speaker embeddings represented by x-vectors and r-vectors. However, existing CNN-based methods tend to focus on local features while overlooking the global dependence of voiceprint features, resulting in the loss of crucial information. Moreover, the presence of noise in audio data is an influential factor that cannot be disregarded, as it significantly impacts the extraction of discriminative speaker embeddings. To address these challenges, we propose a novel model called the Multi-Scale Expand Convolution Transformer (MEConformer). This model aims to convert variable-length audio into a fixed low-dimensional representation. The MEConformer leverages a CNN framework with expanded receptive fields to capture frame-level features effectively. Additionally, we introduce a transformer encoder that incorporates contextual dependencies, enabling the extraction of both frame-level and discourse-level feature representations. Furthermore, we present a multi-scale residual aggregation strategy, which facilitates the efficient transmission of voiceprint information across the model. By combining these innovative components, the MEConformer achieves a state-of-the-art Equal Error Rate (EER) of 3.72% on the VoxCeleb1 test set. Furthermore, it demonstrates EERs of 5.94% and 3.72% on the VoxCeleb1-H and VoxCeleb-E datasets, respectively. The code for the proposed MEConformer model will be made publicly available at https://codeocean.com/capsule/4563012/tree.

求助该文献

最长约 10秒，即可获得该文献文件

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

今日热心研友