计算机科学
光谱图
背景(考古学)
源分离
前端和后端
增采样
语音识别
人工智能
古生物学
图像(数学)
生物
操作系统
作者
Daniel Stoller,Sebastian Ewert,Simon Dixon
出处
期刊:International Symposium/Conference on Music Information Retrieval
日期:2018-06-08
卷期号:: 334-340
被引量:393
标识
DOI:10.5281/zenodo.1492417
摘要
Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.
科研通智能强力驱动
Strongly Powered by AbleSci AI