计算机科学
抄写(语言学)
编码器
语音识别
语言学
操作系统
哲学
作者
Yan Huang,Piyush Behre,Guoli Ye,Shawn Chang,Yifan Gong
标识
DOI:10.1109/asru57964.2023.10389653
摘要
Human professional transcription services provide a variety of transcription styles to customize different needs. To accommodate different users and facilitate seamless integration with downstream applications, we propose a framework to generate multi-style transcription in an attention-based encoder-decoder model (AED) using three different architectures: (A) style-dependent layers; (B) mixed-style output; (C) style-dependent prompt. In this framework, both the verbatim lexical transcription and the readable transcription of various styles can be generated simultaneously or separately, through a single decoding pass or multiple decoding passes on-demand. We conduct experiments in a large-scale AED-based speech transcription system trained with 50k hours speech. The proposed framework can achieve nearly on-par performance compared to the single-style AED with significant savings in model footprint and decoding cost. Moreover, it provides an efficient data sharing mechanism across different styles through knowledge transfer.
科研通智能强力驱动
Strongly Powered by AbleSci AI