计算机科学
光流
变压器
情态动词
人工智能
动作识别
RGB颜色模型
计算机视觉
模式识别(心理学)
语音识别
电压
工程类
图像(数学)
电气工程
化学
高分子化学
班级(哲学)
作者
Chen, Jiawei,Chiu Man Ho
出处
期刊:Cornell University - arXiv
日期:2021-08-20
标识
DOI:10.48550/arxiv.2108.09322
摘要
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.
科研通智能强力驱动
Strongly Powered by AbleSci AI