计算机科学
光谱图
人工智能
分类器(UML)
模式识别(心理学)
音频信号处理
语音识别
帧(网络)
计算机视觉
音频信号
语音编码
电信
作者
Alireza Nasiri,Yuxin Cui,Zhonghao Liu,Jing Jin,Yong Zhao,Jianjun Hu
标识
DOI:10.1109/ictai.2019.00074
摘要
Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.
科研通智能强力驱动
Strongly Powered by AbleSci AI