计算机科学
自然性
语音识别
学习迁移
人工智能
可理解性(哲学)
自然语言处理
说话人识别
特征(语言学)
语言学
哲学
物理
认识论
量子力学
作者
Myeonghun Jeong,Minchan Kim,Byoung Jin Choi,Jaesam Yoon,Won Jang,Nam Soo Kim
标识
DOI:10.1109/taslp.2024.3364085
摘要
Though neural text-to-speech (TTS) models show remarkable performance, they still require a large amount of < speech, text> paired dataset, which is expensive to collect. The heavy demand for collecting paired datasets makes the TTS models support only a small number of speakers and languages. To address this problem, we introduce a transfer learning framework for multi-lingual, zero-shot multi-speaker, and low-resource TTS. Firstly, we pretrain our model in an unsupervised manner with a multi-lingual multi-speaker speech-only dataset by leveraging the self-supervised speech representations as intermediate linguistic representations. Given this pretrained linguistic information, we then apply a supervised learning technique to the TTS model with a small amount of paired dataset. The pretrained linguistic representations extracted from the large-scale speech-only dataset facilitate phoneme-to-linguistic feature matching, which provides good guidance for supervised learning with a limited amount of labeled data. We evaluate the performance of our proposed model in low-resource, multi-lingual, and zero-shot multi-speaker TTS tasks. The experimental results demonstrate that our proposed method outperforms the baseline in terms of naturalness, intelligibility, and speaker similarity.
科研通智能强力驱动
Strongly Powered by AbleSci AI