Multi-Modal Feature Synergy in Dual-Stream Networks with Cross-Attention for Action Recognition

计算机科学人工智能特征（语言学）动作（物理）模式识别（心理学）动作识别人工神经网络特征提取特征选择钥匙（锁）领域（数学）

作者

Junchi Lu,Zhitong Liu,Bing Xu,Yu Fu,H. J. Yang

标识

DOI：10.1109/iccc68654.2025.11437800

摘要

The vulnerability of RGB-based human action recognition systems in complex environments and dynamic scenarios can be mitigated through the integration of skeleton modality. Thus, multimodal action recognition methods that collaborate RGB and skeleton data have been gaining growing attention. However, due to insufficient optimization of sampling methods, feature modeling strategies, and cross-modal fusion strategies, the recognition performance of existing methods remains limited. To address these limitations, we propose a multi-modal feature synergy in dual-stream network with crossattention for action recognition (MMActionFormer) which is specifically designed to leverage the complementary semantic information between RGB and skeleton modalities to achieve better action recognition performance. Specifically, we first design modality-specific sampling strategies based on the inherent advantages of RGB and skeleton data. Subsequently, spatial cues derived from the skeleton are utilized to guide the adaptive cropping of key motion regions within RGB frames, thereby mitigating the confounding effect of irrelevant background clutter. Furthermore, a lightweight feature encoding module is introduced to perform discriminative representation learning, which retains action-related key semantic features while achieving dimension reduction and improving computational efficiency. Notably, a novel cross-attention mechanism is elaborately designed to model inter-modal dependencies and facilitate bidirectional feature refinement between RGB and skeleton representations. Experiments conducted on action datasets (UCF101, HMDB-51, Kinetics400, and Kinetics600) show that the proposed MMActionFormer effectively leverages the complementary properties of RGB and skeleton modalities, thereby significantly improving recognition accuracy. Importantly, our framework achieves competitive performance compared with existing representative methods while significantly accelerating inference speed.

求助该文献

最长约 10秒，即可获得该文献文件

Multi-Modal Feature Synergy in Dual-Stream Networks with Cross-Attention for Action Recognition

今日热心研友