Pose-Appearance Relational Modeling for Video Action Recognition

计算机科学人工智能稳健性（进化）姿势计算机视觉模式识别（心理学）光流动作识别背景（考古学）关节式人体姿态估计图像（数学）三维姿态估计古生物学生物化学化学生物基因班级（哲学）

作者

Mengmeng Cui,Wei Wang,Kunbo Zhang,Zhenan Sun,Liang Wang

出处

期刊：IEEE transactions on image processing [Institute of Electrical and Electronics Engineers]
日期：2023-01-01 卷期号：32: 295-308 被引量：5

链接

nih.govdoi.org

标识

DOI：10.1109/tip.2022.3228156

摘要

Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.

求助该文献

最长约 10秒，即可获得该文献文件

Pose-Appearance Relational Modeling for Video Action Recognition

今日热心研友