变压器
计算机科学
借口
推论
人工智能
计算机视觉
机器学习
工程类
电压
电气工程
法学
政治
政治学
作者
Chaitanya K. Ryali,Yuan-Ting Hu,Daniel Bolya,Wei Chen,Haoqi Fan,Po-Yao Huang,Vaibhav Aggarwal,Arkabandhu Chowdhury,Omid Poursaeed,Judy Hoffman,Jitendra Malik,Yanghao Li,Christoph Feichtenhofer
出处
期刊:Cornell University - arXiv
日期:2023-06-01
被引量:58
标识
DOI:10.48550/arxiv.2306.00989
摘要
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.
科研通智能强力驱动
Strongly Powered by AbleSci AI