人工智能
计算机视觉
分割
计算机科学
图像分割
对象(语法)
尺度空间分割
模式识别(心理学)
图像处理
图像(数学)
作者
Jinyu Yang,Mingqi Gao,Feng Zheng,Xiantong Zhen,Rongrong Ji,Ling Shao,Aleš Leonardis
标识
DOI:10.1109/tip.2024.3374130
摘要
Depth information opens up opportunities for video object segmentation (VOS) to be more accurate and robust in complex scenes. However, RGBD VOS is still unexplored due to the high-cost collection and time-consuming annotation of RGBD segmentation data. In this work, we first introduce a new benchmark for RGBD VOS, named DepthVOS, which contains 350 videos (over 55k frames) and is annotated with masks and bounding boxes. Then, we propose a novel and strong baseline model - Fused Color-Depth Network (FusedCDNet) which can be learned merely under bounding box supervision and then be used to generate masks with a bounding box guideline only in the first frame. In summary, our model includes three major advantages: a weakly-supervised training strategy to overcome the high-cost labeling, a cross-modal fusion module to handle complex scenes, and weakly-supervised prediction to promote ease of use. Extensive experiments demonstrate that our proposed method performs on par with top fully-supervised algorithms. We will open-source our project http://github.com/yjybuaa/depthvos/, which will facilitate the development of RGBD VOS.
科研通智能强力驱动
Strongly Powered by AbleSci AI