计算机科学
人工智能
计算机视觉
模式识别(心理学)
变压器
简单(哲学)
机器学习
工程类
哲学
认识论
电压
电气工程
作者
Qibin Hou,Cheng-Ze Lu,Ming‐Ming Cheng,Jiashi Feng
标识
DOI:10.1109/tpami.2024.3401450
摘要
Vision Transformers have been the most popular network architecture in visual recognition recently due to the strong ability of encode global information. However, its high computational cost when processing high-resolution images limits the applications in downstream tasks. In this paper, we take a deep look at the internal structure of self-attention and present a simple Transformer style convolutional neural network (ConvNet) for visual recognition. By comparing the design principles of the recent ConvNets and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels ( ≥ 7×7) nested in convolutional layers and we observe a consistent performance improvement when gradually increasing the kernel size from 5×5 to 21×21. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20 k semantic segmentation.
科研通智能强力驱动
Strongly Powered by AbleSci AI