计算机科学
光流
人工智能
卷积神经网络
判别式
RGB颜色模型
运动(物理)
模式识别(心理学)
计算机视觉
钥匙(锁)
帧(网络)
块(置换群论)
图像(数学)
电信
计算机安全
数学
几何学
标识
DOI:10.1109/tmm.2022.3148588
摘要
Recent years have witnessed the popularity of using a two-stream architecture and attention mechanism for action recognition with videos. However, it is time-consuming to train two separate convolutional neural networks (ConvNets), especially with the complex attention mechanism. In this paper, we present a novel architecture, termed as Appearance-Motion Fusion Network (AMFNet), to learn efficient and robust action representation from RGB and optical flow data in an end-to-end manner. AMFNet is constructed by connecting a convolutional neural network with an appearance-motion fusion block (AMFB), whose goal is to incorporate appearance and motion streams into a unified framework driven by a cross-modality attention (CMA) mechanism. More specifically, the CMA only relies on optical flow data, which consists of a Key-Frame Adaptive Selection Module (KFASM) and an Optical-Flow-Driven Spatial Attention Module (OFDSAM). The former aims to adaptively identify the discriminative key frames from a sequence, while the latter is able to guide our networks to focus on the action-relevant regions of each frame. We explore two schemes for appearance and motion streams fusion in AMFB from hierarchical and comprehensive levels. The proposed AMFNet is extensively evaluated on five action recognition data sets, including HMDB-51, UCF-101, JHMDB, Penn and Kinetics-400. Compared to the state-of-the-art methods operated at RGB and optical flow, the experimental results validate that our AMFNet achieves a comparable performance with a pure 2D-Single-ConvNet design.
科研通智能强力驱动
Strongly Powered by AbleSci AI