A Simple Yet Effective Network Based on Vision Transformer for Camouflaged Object and Salient Object Detection

人工智能计算机视觉目标检测计算机科学对象（语法）变压器突出模式识别（心理学）视觉对象识别的认知神经科学图像处理图像（数学）工程类电压电气工程

作者

Chao Hao,Zitong Yu,Xin Liu,Jun Xu,Huanjing Yue,Jingyu Yang

出处

期刊：IEEE transactions on image processing [Institute of Electrical and Electronics Engineers]
日期：2025-01-01 卷期号：34: 608-622 被引量：34

链接

nih.govdoi.org

标识

DOI：10.1109/tip.2025.3528347

摘要

Camouflaged object detection (COD) and salient object detection (SOD) are two distinct yet closely-related computer vision tasks widely studied during the past decades. Though sharing the same purpose of segmenting an image into binary foreground and background regions, their distinction lies in the fact that COD focuses on concealed objects hidden in the image, while SOD concentrates on the most prominent objects in the image. Building universal segmentation models is currently a hot topic in the community. Previous works achieved good performance on certain task by stacking various hand-designed modules and multi-scale features. However, these careful task-specific designs also make them lose their potential as general-purpose architectures. Therefore, we hope to build general architectures that can be applied to both tasks. In this work, we propose a simple yet effective network (SENet) based on vision Transformer (ViT), by employing a simple design of an asymmetric ViT-based encoder-decoder structure, we yield competitive results on both tasks, exhibiting greater versatility than meticulously crafted ones. To enhance the performance of universal architectures on both tasks, we propose some general methods targeting some common difficulties of the two tasks. First, we use image reconstruction as an auxiliary task during training to increase the difficulty of training, forcing the network to have a better perception of the image as a whole to help with segmentation tasks. In addition, we propose a local information capture module (LICM) to make up for the limitations of the patch-level attention mechanism in pixel-level COD and SOD tasks and a dynamic weighted loss (DW loss) to solve the problem that small target samples are more difficult to locate and segment in both tasks. Finally, we also conduct a preliminary exploration of joint training, trying to use one model to complete two tasks simultaneously. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method. The code is available at https://github.com/linuxsino/SENet.

求助该文献

最长约 10秒，即可获得该文献文件

A Simple Yet Effective Network Based on Vision Transformer for Camouflaged Object and Salient Object Detection

今日热心研友