计算机科学
自然性
光谱图
语音识别
水准点(测量)
趋同(经济学)
人工智能
光学(聚焦)
人工神经网络
生成模型
生成语法
模式识别(心理学)
物理
大地测量学
量子力学
经济增长
光学
经济
地理
作者
Shi Sheng,Jiahao Shao,Hong Hao,Yangzhou Du,Jianping Fan
标识
DOI:10.1109/icassp43922.2022.9746992
摘要
Non-parallel voice conversion (VC) is a technique of transfer-ring voice from one style to another without using a parallel corpus in model training. Various methods are proposed to approach non-parallel VC using deep neural networks. Among them, CycleGAN-VC and its variants have been widely accepted as benchmark methods. However, there is still a gap to bridge between the real target and converted voice and an increased number of parameters leads to slow convergence in training process. Inspired by recent advancements in unsupervised image translation, we propose a new end-to-end unsupervised framework U-GAT-VC that adopts a novel inter- and intra-attention mechanism to guide the voice conversion to focus on more important regions in spectrograms. We also introduce disentangle perceptual loss in our model to capture high-level spectral features. Subjective and objective evaluations shows our proposed model outperforms CycleGAN-VC2/3 in terms of conversion quality and voice naturalness.
科研通智能强力驱动
Strongly Powered by AbleSci AI