模态(人机交互)
计算机科学
语音识别
人工智能
计算机视觉
自然语言处理
作者
Cheng Luo,Yiguang Liu,Wenhui Sun,Zhoujian Sun
标识
DOI:10.1109/icassp48485.2024.10446142
摘要
Visual information is often used as a complementary cue for automatic speech recognition in noisy environments. Most previous studies utilize visual information of target speakers (e.g., lip movements) to improve the recognition performance of audio-visual speech recognition (AVSR) models. However, it remains unclear whether visual information of background sound can benefit automatic speech recognition. Our study proceeds in this regard by constructing a new audiovisual dataset and devising an AVSR model. The new dataset, Audio-Visual Natural Scenes (abbreviated as AVNS) dataset, consists of 11 types of natural scenes (around 31.3 hours) and was recorded through professional recording devices. The AVNS dataset provides audio and visual signals of common background noises in natural acoustic scenes. The AVSR model was designed based on a representation learning framework called AV-HuBERT, which could fuse representations of audio and visual modalities for automatic speech recognition. In this work, we combined the AVNS dataset (providing background sound) with the largest benchmark LRS3 dataset (providing target speech) to create adverse noise conditions for the AVSR model. The results showed that incorporating visual information synchronized with background noises greatly improved model performance (reducing WER by up to 4.9%) in noisy environments. These findings demonstrate that noise-related visual information can contribute to model performance in automatic speech recognition.
科研通智能强力驱动
Strongly Powered by AbleSci AI