计算机科学
变压器
计算
卷积神经网络
人工智能
堆积
缩放比例
深度学习
特征学习
失败
编码(集合论)
模式识别(心理学)
机器学习
算法
并行计算
工程类
电压
数学
核磁共振
电气工程
物理
几何学
集合(抽象数据类型)
程序设计语言
作者
Daquan Zhou,Bingyi Kang,Xiaojie Jin,Linjie Yang,Xiaochen Lian,Zihang Jiang,Qibin Hou,Jiashi Feng
出处
期刊:Cornell University - arXiv
日期:2021-01-01
被引量:330
标识
DOI:10.48550/arxiv.2103.11886
摘要
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet. Code is publicly available at https://github.com/zhoudaquan/dvit_repo.
科研通智能强力驱动
Strongly Powered by AbleSci AI