发音
判别式
计算机科学
语音识别
人工智能
连接主义
编码器
模式识别(心理学)
任务(项目管理)
人工神经网络
自然语言处理
语言学
操作系统
哲学
经济
管理
作者
Binghuai Lin,Liyuan Wang
标识
DOI:10.1109/slt54892.2023.10022486
摘要
This paper proposes an end-to-end pronunciation assessment method to exploit the adequate native data and reduce the need for non-native data costly to label. To obtain discriminative acoustic representations at the phoneme level, the pretrained wav2vec 2.0 is re-trained with connectionist temporal classification (CTC) loss for phoneme recognition using native data. These acoustic representations are fused with phoneme representations derived from a phoneme encoder to obtain final pronunciation scores. An efficient fusion mechanism aligns each phoneme with acoustic frames based on attention, where all blank frames recognized by the CTC-based phoneme recognition are masked. Finally, the whole network is optimized by a multi-task learning framework combining CTC loss and mean square error loss between predicted and human scores. Extensive experiments demonstrate that it outperforms previous baselines in the Pearson correlation coefficient even with much fewer labeled non-native data.
科研通智能强力驱动
Strongly Powered by AbleSci AI