分割
计算机科学
人工智能
特征(语言学)
阶段(地层学)
模式识别(心理学)
语义特征
语言学
地质学
哲学
古生物学
作者
Yingdong Ma,Xiaobin Lan
标识
DOI:10.1016/j.imavis.2024.104996
摘要
Recently, vision transformers have demonstrated strong performance in various computer vision tasks. The success of ViTs can be attribute to the ability of capturing long-range dependencies. However, transformer-based approaches often yield segmentation maps with incomplete object structures because of restricted cross-stage information propagation and lack of low-level details. To address these problems, we introduce a CNN-transformer semantic segmentation architecture which adopts a CNN backbone for multi-level feature extraction and a transformer encoder that focuses on global perception learning. Transformer embeddings of all stages are integrated to compute feature weights for dynamic cross-stage feature reweighting. As a result, high-level semantic context and low-level spatial details can be embedded into each stage to preserve multi-level information. An efficient attention-based feature fusion mechanism is developed to combine reweighted transformer embeddings with CNN features to generate segmentation maps with more complete object structure. Different from regular self-attention that has quadratic computational complexity, our efficient self-attention method achieves similar performance with linear complexity. Experimental results on ADE20K and Cityscapes datasets show that the proposed segmentation approach demonstrates superior performance against most state-of-the-art networks.
科研通智能强力驱动
Strongly Powered by AbleSci AI