计算机科学
判别式
人工智能
分割
像素
词汇
班级(哲学)
模式识别(心理学)
合成数据
弹丸
图像(数学)
计算机视觉
哲学
语言学
化学
有机化学
作者
Weijia Wu,Yuzhong Zhao,Mike Zheng Shou,Hong Zhou,Chunhua Shen
标识
DOI:10.1109/iccv51070.2023.00117
摘要
Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model, which uses only text-image pairs during training. Our approach, termed DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to significantly reduce data collection and annotation costs. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the state-of-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves new state-of-the-art results on the Unseen classes of VOC 2012. The project website can be found at ${\color{red}{\text{DiffuMask}}}$.
科研通智能强力驱动
Strongly Powered by AbleSci AI