计算机科学
联营
人工神经网络
语音识别
时滞神经网络
频道(广播)
模式识别(心理学)
帧(网络)
背景(考古学)
人工智能
说话人识别
电信
生物
古生物学
作者
Brecht Desplanques,Jenthe Thienpondt,Kris Demuynck
标识
DOI:10.21437/interspeech.2020-2650
摘要
Current speaker verification techniques rely on a neural network to extract\nspeaker representations. The successful x-vector architecture is a Time Delay\nNeural Network (TDNN) that applies statistics pooling to project\nvariable-length utterances into fixed-length speaker characterizing embeddings.\nIn this paper, we propose multiple enhancements to this architecture based on\nrecent trends in the related fields of face verification and computer vision.\nFirstly, the initial frame layers can be restructured into 1-dimensional\nRes2Net modules with impactful skip connections. Similarly to SE-ResNet, we\nintroduce Squeeze-and-Excitation blocks in these modules to explicitly model\nchannel interdependencies. The SE block expands the temporal context of the\nframe layer by rescaling the channels according to global properties of the\nrecording. Secondly, neural networks are known to learn hierarchical features,\nwith each layer operating on a different level of complexity. To leverage this\ncomplementary information, we aggregate and propagate features of different\nhierarchical levels. Finally, we improve the statistics pooling module with\nchannel-dependent frame attention. This enables the network to focus on\ndifferent subsets of frames during each of the channel's statistics estimation.\nThe proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art\nTDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker\nRecognition Challenge.\n
科研通智能强力驱动
Strongly Powered by AbleSci AI