计算机科学
语音识别
语音合成
光谱图
人工神经网络
语音活动检测
说话人识别
公制(单位)
语音处理
平均意见得分
嵌入
语音语料库
人工智能
作者
Daigang Chen,Hua Jiang,Chengxi Pu,Shaowen Yao
摘要
In recent years, speech synthesis based on machine learning has become more and more popular. At present, there are many kinds of neural network models that can generate synthetic audio which highly imitates human voice. The quality of these generated audio is usually evaluated by mean opinion score (MOS). Voiceprint is an important metric to distinguish the speaker's speech features. Generating voice speech with specific voiceprint features is of great significance to improve the application of speech synthesis. However, the existing speech synthesis models seldom consider the preservation of specific voiceprint features. In this paper, we propose D-MelGAN, a speech synthesis model targeting to high-quality voice speech with specific speaker voiceprint features. The model is based on the non-autoregressive feedforward convolution neural network of GANs. By embedding the d-vector technology used to identify specific voiceprints in GANs, the original audio waveform with the characteristics of specific speaker voiceprints is further generated. The experimental results show that the new model can increase the voiceprint features of the generated audio, and the quality of the synthesized speech can be well maintained, which will make the generated speech have the specific style of a speaker, the text to speech technology will be applied to more fields.
科研通智能强力驱动
Strongly Powered by AbleSci AI