计算机科学
运动(物理)
卷积(计算机科学)
混乱的
人工智能
钥匙(锁)
帧(网络)
动作(物理)
代表(政治)
骨料(复合)
调制(音乐)
模式识别(心理学)
计算机视觉
人工神经网络
电信
物理
材料科学
计算机安全
量子力学
政治
政治学
声学
法学
复合材料
作者
Weiji Zhao,Kefeng Huang,Chongyang Zhang
标识
DOI:10.1109/icassp49357.2023.10095853
摘要
The goal of spatial-temporal action detection is to generate spatial-temporally aligned action tubes. Most of the existing 2D CNN-based solutions directly aggregate temporal adjacent contexts through frames without alignment. The misaligned spatial-temporal contextual features might lead to chaotic representation and misaligned action tubes. Moreover, most existing methods fail to efficiently exploit motion dependencies. In this paper, we propose Modulation-based Center Alignment (MCA) and Sparse Valuable Motion Mining (SVMM) for more accurate action detection: With deformable convolution, key-frame based modulation is firstly designed to align the action center between temporal frames; then motion region guided sparse self-attention is developed for valuable motion mining. Our framework can outperform current 2D CNN-based methods significantly, based on the experimental result on two widely used benchmarks of JH-MDB and UCF101-24.
科研通智能强力驱动
Strongly Powered by AbleSci AI