计算机科学
人工智能
利用
模棱两可
变压器
姿势
RGB颜色模型
计算机视觉
编码器
粒度
模式识别(心理学)
语音识别
工程类
操作系统
程序设计语言
电压
电气工程
计算机安全
作者
Yilin Wen,Hao Pan,Lei Yang,Jia Pan,Taku Komura,Wenping Wang
出处
期刊:Cornell University - arXiv
日期:2022-09-20
被引量:5
标识
DOI:10.48550/arxiv.2209.09484
摘要
Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.
科研通智能强力驱动
Strongly Powered by AbleSci AI