计算机科学
自然性
对话
连贯性(哲学赌博策略)
语音合成
语音识别
代表(政治)
任务(项目管理)
光学(聚焦)
人工智能
语言学
哲学
物理
管理
光学
量子力学
政治
政治学
法学
经济
作者
Kangdi Mei,Zhaoci Liu,Hui-Peng Du,Hengyu Li,Yang Ai,Liping Chen,Zhen-Hua Ling
标识
DOI:10.1109/icassp48485.2024.10448356
摘要
Conversational speech synthesis aims to synthesize speech of an individual speaker based on history conversation. However, most studies in conversational speech synthesis only focus on the synthesis performance of the current speaker's turn and neglect the temporal relationship between turns of interlocutors. Therefore, we consider the temporal connection between turns for conversational speech synthesis, which is crucial for the naturalness and coherence of conversations. Specifically, this paper formulates a task in which there is no overlap between turns and only one history turn is considered. To complete this task, an acoustic model is proposed which leverages multi-modal (including text and speech) information from previous turn to predict the acoustic features of not only current turn but also the inter-turn gap. The model is designed based on MQTTS and incorporates the global acoustic representation and BERT-based local semantic representation of previous turn when predicting the acoustic features of each frame. Experimental results demonstrate that with the introduction of global acoustic information and local semantic information, our model achieves better performance on the temporal connection between turns and the quality of synthetic speech. Audio samples can be found in https://mkd-mkd.github.io/icassp2024.
科研通智能强力驱动
Strongly Powered by AbleSci AI