计算机科学
变压器
人工智能
计算机视觉
计算机工程
电气工程
工程类
电压
作者
Zhuofan Xia,Xuran Pan,Shiji Song,Li Erran Li,Gao Huang
标识
DOI:10.1109/cvpr52688.2022.00475
摘要
Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply enlarging receptive field also gives rise to several concerns. On the one hand, using dense attention e.g., in ViT, leads to excessive memory and computational cost, and features can be influenced by irrelevant parts which are beyond the region of interests. On the other hand, the sparse attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long range relations. To mitigate these issues, we propose a novel deformable selfattention module, where the positions of key and value pairs in selfattention are selected in a data-dependent way. This flexible scheme enables the self-attention module to focus on relevant re-gions and capture more informative features. On this basis, we present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks. Extensive experi-ments show that our models achieve consistently improved results on comprehensive benchmarks. Code is available at https://github.com/LeapLabTHU/DAT.
科研通智能强力驱动
Strongly Powered by AbleSci AI