计算机科学
人工智能
分割
图像分割
特征(语言学)
模式识别(心理学)
特征提取
计算机视觉
尺度空间分割
语义学(计算机科学)
自然语言处理
语言学
哲学
程序设计语言
作者
Quan Tang,Chuanjian Liu,Fagui Liu,Jun Jiang,Bowen Zhang,C. L. Philip Chen,Kai Han,Yunhe Wang
标识
DOI:10.1109/tip.2025.3534532
摘要
The encoder-decoder architecture is a prevailing paradigm for semantic segmentation. It has been discovered that aggregation of multi-stage encoder features plays a significant role in capturing discriminative pixel representation. In this work, we rethink feature reconstruction for scale alignment of multi-stage pyramidal features and treat it as a Query Update (Q-UP) task. Pixel-wise affinity scores are calculated between the high-resolution query map and low-resolution feature map to dynamically broadcast low-resolution pixel features to match a higher resolution. Unlike prior works (e.g. bilinear interpolation) that only exploit sub-pixel neighborhoods, Q-UP samples contextual information within a global receptive field via a data-dependent manner. To alleviate intra-category feature variance, we substitute source pixel features for feature reconstruction with their corresponding category prototype that is assessed by averaging all pixel features belonging to that category. Besides, a memory module is proposed to explore the capacity of category prototypes at the dataset level. We refer to the method as Category Prototype Transformer (CPT). We conduct extensive experiments on popular benchmarks. Integrating CPT into a feature pyramid structure exhibits superior performance for semantic segmentation even with low-resolution feature maps, e.g. 1/32 of the input size, significantly reducing computational complexity. Specifically, the proposed method obtains a compelling 55.5% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI