PAV-SOD: A New Task towards Panoramic Audiovisual Saliency Detection

计算机科学人工智能计算机视觉分割水准点（测量）任务（项目管理）对象（语法）卷积神经网络模式识别（心理学）管理大地测量学经济地理

作者

Yi Zhang,Fang-Yi Chao,Wassim Hamidouche,Olivier Déforges

出处

期刊：ACM Transactions on Multimedia Computing, Communications, and Applications [Association for Computing Machinery]
日期：2022-09-30 卷期号：19 (3): 1-26 被引量：3

链接

hal.sciencedoi.org

标识

DOI：10.1145/3565267

摘要

Object-level audiovisual saliency detection in 360° panoramic real-life dynamic scenes is important for exploring and modeling human perception in immersive environments, also for aiding the development of virtual, augmented, and mixed reality applications in fields such as education, social network, entertainment, and training. To this end, we propose a new task, p anoramic a udio v isual s alient o bject d etection, ( PAV-SOD 1 ), which aims to segment the objects grasping most of the human attention in 360° panoramic videos reflecting real-life daily scenes. To support the task, we collect PAVS10K , the first p anoramic video dataset for a udio v isual s alient object detection, which consists of 67 4K-resolution equirectangular videos with per-video labels including hierarchical scene categories and associated attributes depicting specific challenges for conducting PAV-SOD , and 10,465 uniformly sampled video frames with manually annotated object-level and instance-level pixel-wise masks. The coarse-to-fine annotations enable multi-perspective analysis regarding PAV-SOD modeling. We further systematically benchmark 13 state-of-the-art salient object detection (SOD)/video object segmentation (VOS) methods based on our PAVS10K . Besides, we propose a new baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE). Our C VAE-based a udio v isual net work, namely, CAV-Net , consists of a spatial-temporal visual segmentation network, a convolutional audio-encoding network, and audiovisual distribution estimation modules. As a result, our CAV-Net outperforms all competing models and is able to estimate the aleatoric uncertainties within PAVS10K . With extensive experimental results, we gain several findings about PAV-SOD challenges and insights towards PAV-SOD model interpretability. We hope that our work could serve as a starting point for advancing SOD towards immersive media.

求助该文献

PAV-SOD: A New Task towards Panoramic Audiovisual Saliency Detection

今日热心研友