Lightweight Adaptive Deep Learning for Efficient Real-Time Speech Enhancement on Edge Devices

计算机科学语音增强比索深度学习人工智能卷积神经网络编码器卷积（计算机科学）语音识别计算复杂性理论 GSM演进的增强数据速率瓶颈计算机工程边缘设备电子工程信号处理语音处理频道（广播）降噪编码（内存）边缘增强适应（眼睛）计算模型边缘计算噪音（视频）自编码人工神经网络翻译（生物学）语音编码实时计算解码方法模式识别（心理学）噪声测量卷积码

作者

Fazal E. Wahab,Zhongfu Ye,Nasir Saleem,Sami Bourouis,Amir Hussain

出处

期刊：IEEE Transactions on Consumer Electronics [Institute of Electrical and Electronics Engineers]
日期：2025-08-15 卷期号：71 (4): 12086-12095 被引量：4

标识

DOI：10.1109/tce.2025.3598007

摘要

Deep learning has significantly advanced speech enhancement (SE) by exploiting hierarchical representations to model complex speech patterns. However, deploying these models on resource-constrained edge devices remains challenging due to computational limitations and real-time processing requirements. Convolutional neural networks (CNNs) face challenges due to frequency translation equivariance, which reduces their sensitivity to frequency-specific features essential for speech-noise separation. Transformer-based SE models are effective at capturing global dependencies but are computationally expensive and less suitable for low-latency edge processing. This study proposes an efficient encoder-decoder architecture optimized for SE on edge devices to address these challenges. The model integrates adaptive frequency-aware gated convolution (AFAGC) in the encoder and a Ginformer-based bottleneck, ensuring robust real-time performance with minimal computational overhead. The encoder incorporates adaptive frequency band positional encoding to mitigate translation equivariance, while gated convolution selectively reweights frequency components to emphasize speech-relevant features. The Ginformer-based bottleneck uses low-rank projections to reduce self-attention complexity and an SRU-based temporal gating to enhance noise adaptation and computational efficiency. Evaluation on the VoiceBank+DEMAND dataset demonstrates that the proposed model outperforms recent SE models, achieving a PESQ of 3.25 and STOI of 95.5%. With only 1.32 million parameters and a real-time factor (RTF) of 0.14, it delivers high-quality speech enhancement suitable for real-time deployment on edge devices.

求助该文献

最长约 10秒，即可获得该文献文件

Lightweight Adaptive Deep Learning for Efficient Real-Time Speech Enhancement on Edge Devices

今日热心研友