人工智能
计算机科学
计算机视觉
变压器
模式识别(心理学)
可分离空间
特征提取
工程类
数学
电压
电气工程
数学分析
作者
Shuyang Sun,Xiaoyu Yue,Hengshuang Zhao,Junwei Han,Song Bai
标识
DOI:10.1109/tpami.2022.3231725
摘要
The computational complexity of transformers limits it to be widely deployed onto frameworks for visual recognition. Recent work [9] significantly accelerates the network processing speed by reducing the resolution at the beginning of the network, however, it is still hard to be directly generalized onto other downstream tasks e.g. object detection and segmentation like CNN. In this paper, we present a transformer-based architecture retaining both the local and global interactions within the network, and can be transferable to other downstream tasks. The proposed architecture reforms the original full spatial self-attention into pixel-wise local attention and patch-wise global attention. Such factorization saves the computational cost while retaining the information of different granularities, which helps generate multi-scale features required by different tasks. By exploiting the factorized attention, we construct a Separable Transformer (SeT) for visual modeling. Experimental results show that SeT outperforms the previous state-of-the-art transformer-based approaches and its CNN counterparts on three major tasks including image classification, object detection and instance segmentation.
科研通智能强力驱动
Strongly Powered by AbleSci AI