计算机科学
视听
音频挖掘
人工智能
可视化
空间分析
感知
语音识别
计算机视觉
声学模型
多媒体
语音处理
遥感
生物
地质学
神经科学
作者
Dandan Zhu,Kaiwei Zhang,Nana Zhang,Qiangqiang Zhou,Xiongkuo Min,Guangtao Zhai,Xiaokang Yang
标识
DOI:10.1109/tmm.2023.3271022
摘要
Spatial audio is a crucial component of omnidirectional videos (ODVs), which can provide an immersive experience by enabling viewers to perceive sound sources in all directions. However, most visual attention modeling works for ODVs focus only on visual cues, and audio modality is rather rarely considered. Additionally, the existing audio-visual saliency models for ODVs lack spatial audio location-awareness (i.e. sound source location-agnostic) and audio content attributes discriminability (i.e. audio content attributes-agnostic). To this end, we propose a novel audio-visual perception saliency (AVPS) model with spatial audio location-awareness and audio content attributes-adaptive to efficiently address the problem of fixation prediction in ODVs. Specifically, we first utilize the improved group equivariant convolutional neural network (G-CNN) with eidetic 3D LSTM (E3D-LSTM) to extract spatial-temporal visual features. Then we perceive sound source locations by computing the audio energy map (AEM) of the audio information in ODVs. Subsequently, we introduce SoundNet to extract audio features with multiple attributes. Finally, we develop an audio-visual feature fusion module to adaptively integrate spatial-temporal visual features and spatial auditory information to generate the final audio-visual saliency map. Extensive experiments in three audio modalities validate the effectiveness of the proposed model. Meanwhile, the performance of the proposed model is superior to the other 10 state-of-the-art saliency models.
科研通智能强力驱动
Strongly Powered by AbleSci AI