计算机科学
语音增强
语音识别
可理解性(哲学)
平均意见得分
深层神经网络
语音处理
积极倾听
人工神经网络
人工智能
语音活动检测
噪音(视频)
背景噪声
噪声测量
稳健性(进化)
任务分析
质量(理念)
生成语法
言语感知
隐马尔可夫模型
作者
Mandar Gogate,Kia Dashtipour,Amir Hussain
出处
期刊:IEEE transactions on artificial intelligence
[Institute of Electrical and Electronics Engineers]
日期:2024-02-15
卷期号:: 1-10
被引量:11
标识
DOI:10.1109/tai.2024.3366141
摘要
The human auditory cortex contextually integrates audio-visual (AV) cues to better understand speech in a cocktail party situation. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in low signal-to-noise ratios ( SNR < −5dB ) environments compared to audio-only (A-only) SE models. However, despite substantial research in the area of AV SE, development of real-time processing models that can generalise across various types of visual and acoustic noises remains a formidable technical challenge. This paper introduces a novel framework for low-latency, speaker-independent AV SE. The proposed framework is designed to generalise to visual and acoustic noises encountered in real world settings. In particular, a generative adversarial network (GAN) is proposed to address the issue of visual speech noise including poor lighting in real noisy environments. In addition, a novel real-time AV SE based on a deep neural network is proposed. The model leverages the enhanced visual speech from the GAN to deliver robust SE. The effectiveness of the proposed framework is evaluated on synthetic AV datasets using objective speech quality and intelligibility metrics. Furthermore, subjective listening tests are conducted using real noisy AV corpora. The results demonstrate that the proposed real-time AV SE framework improves the mean opinion score by 20% as compared to state-of-the-art SE approaches including recent DNN based AV SE models.
科研通智能强力驱动
Strongly Powered by AbleSci AI