Enhancing Deepfake Audio Detection: A ResNet Framework Based on Hybrid Features and Self‐Attention Mechanism

计算机科学机制（生物学）人机交互认识论哲学

作者

Lian Huang,Jixiang Yang,Jinhong Zhao,Lian Huang

出处

期刊：Expert Systems [Wiley]
日期：2025-05-07 卷期号：42 (6)

标识

摘要

ABSTRACT Due to the successful application of deep learning, audio spoofing detection has made significant progress. Spoofed audio with speech synthesis or voice conversion can be detected by many countermeasures well. However, an automatic speaker verification system is still vulnerable to spoofing attacks such as replay or deepfake audio. Deepfake audio, generated using text‐to‐speech (TTS) and voice conversion (VC) algorithms, poses a particularly significant challenge. To address this vulnerability, we propose a novel framework incorporating hybrid features and a self‐attention mechanism for enhanced spoofing detection. Our approach is distinguished by the following key contributions: (1) A novel dual‐path feature extraction architecture, leveraging parallel convolutional neural networks (CNNs) and Short‐Time Fourier Transform (STFT) with Mel‐frequency filtering to capture complementary deep learning and Mel‐spectrogram features, respectively; (2) A max‐pooling‐based feature fusion strategy, concatenating the extracted features to preserve crucial discriminative information; (3) The integration of a self‐attention mechanism to dynamically weight and focus on salient temporal‐spectral patterns within the fused feature representation; (4) A ResNet‐based classifier, augmented with linear layers, for robust spoofing classification. Rigorous evaluation on the ASVspoof 2021 dataset demonstrates the efficacy of our proposed framework. We achieve state‐of‐the‐art performance, attaining Equal Error Rate (EER) of 9.67% in the physical access (PA) scenario and 8.94% in the deepfake task. These results correspond to substantial relative improvements of 74.60% and 60.05%, respectively, compared to the best‐performing baseline systems. These findings underscore the superior discriminative power of our hybrid feature approach, highlighting its ability to capture richer utterance details compared to conventional single‐modality feature representations. This work offers a promising new direction for developing robust ASV systems resilient to increasingly sophisticated spoofing attacks.

求助该文献

最长约 10秒，即可获得该文献文件

Enhancing Deepfake Audio Detection: A ResNet Framework Based on Hybrid Features and Self‐Attention Mechanism

今日热心研友