计算机科学
情态动词
对话
语音识别
代表(政治)
编码器
背景(考古学)
自然语言处理
人工智能
冗余(工程)
连贯性(哲学赌博策略)
语言学
化学
政治
政治学
高分子化学
法学
古生物学
哲学
物理
量子力学
生物
操作系统
作者
Kun Wei,Bei Li,Hang Lv,Quan Lü,Ning Jiang,Lei Xie
标识
DOI:10.1109/taslp.2024.3389630
摘要
Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational-level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains longer context without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.
科研通智能强力驱动
Strongly Powered by AbleSci AI