MMCPose: Multimodal Condition-Driven 3D Human Pose Estimation Via Diffusion Models

计算机科学人工智能姿势关系（数据库）单眼计算机视觉块（置换群论）噪音（视频）感知三维姿态估计机器学习深度学习模式识别（心理学）功率（物理）关节式人体姿态估计特征（语言学）降噪特征学习还原（数学）接头（建筑物）机制（生物学）判别式监督学习可视化方案（数学）数据建模实体造型帧（网络）重点（电信）对比度（视觉）

作者

Xixia Xu,Jiamao Li

出处

期刊：IEEE Transactions on Multimedia [Institute of Electrical and Electronics Engineers]
日期：2026-01-01 卷期号：: 1-11

标识

DOI：10.1109/tmm.2026.3654424

摘要

Nowadays, diffusion-based methods for monocular 3D human pose estimation (3D HPE) have achieved state-of-the-art performance by directly regressing the 3D joint coordinates from the 2D observations. Although some methods incorporated the human body prior to improve the denoising quality, the absense of the structural relation and pose-aware guidance make these models prone to generating unreasonable poses. The challenge is noticeable in complex conditions such as occlusions and crowded scenarios. To alleviate this, we present MMCPose, a novel Multi-modal Condition-driven 3D HPE framework via diffusion models that capitalizes on the benefits of the multi-modal conditioning input. Specifically, we propose Multi-modal Condition Learning (MCL) strategy to incorporate multi-modal conditions such as joint- wise relation, part-aware prompt and pose-aware mask to improve the generation quality. The MCL block consists of (i) Joint- wise Relation Condition Learning (JRCL) models the flexible joint- wise relation via GCN to mitigate disturbances arising from confused joints. (ii) Part-aware Prompt Condition Learning (PPCL) constructs multi-granular prompts via accessible texts and feasible knowledge of body parts with learnable prompts to model implicit textual guidance. (iii) Pose-aware Mask Condition Learning (PMCL) designs a pose-specific mask to increase the model's emphasis to the pose region, augmenting the precision in capturing intricate pose details. Furthermore, we explore a multi-modal condition-pose interaction learning (MCPI) mechanism to establish interaction between the learned multi-modal conditions and poses to maximize the power of condition effect. This method fully unleashes the perceptual capability of the multi-modal conditions in diffusion-based 3D HPE. Extensive evaluations conducted on two popular benchmarks (e.g., Human3.6 M, MPI-INF-3DHP) and achieve new state-of-the-art performance.

求助该文献

最长约 10秒，即可获得该文献文件

MMCPose: Multimodal Condition-Driven 3D Human Pose Estimation Via Diffusion Models

今日热心研友