变压器
计算机科学
安全性令牌
动作识别
人工智能
地点
模式识别(心理学)
机器学习
计算机安全
语言学
量子力学
物理
哲学
电压
班级(哲学)
作者
Weirong Sun,Yujun Ma,Ruili Wang
出处
期刊:Neurocomputing
[Elsevier BV]
日期:2024-01-11
卷期号:574: 127256-127256
被引量:11
标识
DOI:10.1016/j.neucom.2024.127256
摘要
Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI