自然性
计算机科学
对话
背景(考古学)
依赖关系(UML)
自然语言处理
语音合成
常识
语音识别
上下文模型
人工智能
心理学
知识库
对象(语法)
沟通
古生物学
物理
量子力学
生物
作者
Yayue Deng,Jinlong Xue,Fengping Wang,Yingming Gao,Ya Li
标识
DOI:10.1145/3581783.3612565
摘要
Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral communication. However, the complexity of context dependency modeling poses significant challenges in the field of CSS, especially the mutual psychological influence between interlocutors. Previous studies have verified that prior commonsense knowledge helps machines understand subtle psychological information (e.g., feelings and intentions) in spontaneous oral dialogues. Therefore, to enhance context understanding and improve the naturalness of synthesized speech, we propose a novel conversational speech synthesis system (CMCU-CSS) that incorporates the Commonsense-based Multi-modal Context Understanding (CMCU) module to model the dynamic emotional interaction among interlocutors. Specifically, we first utilize three implicit states (intent state, internal state and external state) in CMCU to model the context dependency between inter/intra speakers with the help of commonsense knowledge. Furthermore, we infer emotion vectors from the fusion of these implicit states and multi-modal features to enhance the emotion discriminability of synthesized speech. This is the first attempt to combine commonsense knowledge with conversational speech synthesis, and its effect in terms of emotion discriminability of synthetic speech is evaluated by emotion recognition in conversation task. The results of subjective and objective evaluations demonstrate that the CMCU-CSS model achieves more natural speech with context-appropriate emotion and is equipped with the best emotion discriminability, surpassing that of other conversational speech synthesis models.
科研通智能强力驱动
Strongly Powered by AbleSci AI