直觉
计算机科学
残差神经网络
可视化
人工智能
机器学习
透视图(图形)
火车
模式识别(心理学)
人工神经网络
代表(政治)
数据挖掘
地图学
地理
心理学
政治
政治学
法学
认知科学
作者
Hehua Zhu,Boyuan Chen,Carter Yang
出处
期刊:Cornell University - arXiv
日期:2023-02-07
标识
DOI:10.48550/arxiv.2302.03751
摘要
Vision transformer (ViT) is an attention neural network architecture that is shown to be effective for computer vision tasks. However, compared to ResNet-18 with a similar number of parameters, ViT has a significantly lower evaluation accuracy when trained on small datasets. To facilitate studies in related fields, we provide a visual intuition to help understand why it is the case. We first compare the performance of the two models and confirm that ViT has less accuracy than ResNet-18 when trained on small datasets. We then interpret the results by showing attention map visualization for ViT and feature map visualization for ResNet-18. The difference is further analyzed through a representation similarity perspective. We conclude that the representation of ViT trained on small datasets is hugely different from ViT trained on large datasets, which may be the reason why the performance drops a lot on small datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI