安全性令牌
计算机科学
变压器
人工智能
对象(语法)
导师
加速
目标检测
计算机视觉
人机交互
模式识别(心理学)
计算机网络
工程类
电压
程序设计语言
操作系统
电气工程
作者
Danyang Tu,Wei Sun,Xiongkuo Min,Guangtao Zhai,Wei Shen
出处
期刊:Cornell University - arXiv
日期:2022-06-04
被引量:8
标识
DOI:10.48550/arxiv.2206.01908
摘要
We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Expressiveness: each tubelet token is enabled to align with a semantic instance, i.e., an object or a human, across frames, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results shows our method outperforms existing works by large margins, with a relative mAP gain of $16.14\%$ on VidHOI and a 2 points gain on CAD-120 as well as a $4 \times$ speedup.
科研通智能强力驱动
Strongly Powered by AbleSci AI