人工智能
计算机科学
稳健性(进化)
RGB颜色模型
计算机视觉
姿势
模式识别(心理学)
交叉模态
变压器
工程类
视觉感受
感知
生物化学
化学
电压
神经科学
生物
电气工程
基因
作者
Yingying An,Dedong Yang,Mengyuan Song
出处
期刊:Measurement
[Elsevier]
日期:2024-01-01
卷期号:224: 113848-113848
标识
DOI:10.1016/j.measurement.2023.113848
摘要
Visual information is usually multimodal, including texture, color (2D information), and space (3D information). However, there are two problems in establishing multimodal 6D object pose estimation: (1) substantial differences between RGB images and depth data; (2) systematic noise in the depth images and lack contextual information in the association process. To solve the above problems, this paper proposes an end-to-end hierarchical feature transformer (HFT6D) containing four independent stages of crossmodal transformer. The novel hierarchical feature architecture suppresses the effect of noise by modeling the spatial correspondence between two different modalities. The core module of HFT6D is the bi-directional crossmodal attention, which aligns the appearance and geometric representation by recalibrating RGB-D data. In addition, our proposed HFT6D is real-time and achieves robustness against occluded scenes. Comprehensive experiments on two benchmark datasets show that HFT6D achieves state-of-the-art performance in terms of accuracy and speed.
科研通智能强力驱动
Strongly Powered by AbleSci AI