端到端原则
计算机科学
接头(建筑物)
融合
情态动词
计算机视觉
人工智能
工程类
结构工程
语言学
材料科学
哲学
高分子化学
作者
Yiqun Duan,Xianda Guo,Zhu Zheng,Zhen Wang,Yukai WANG,Chin‐Teng Lin
出处
期刊:Cornell University - arXiv
日期:2024-05-13
被引量:1
标识
DOI:10.48550/arxiv.2405.07573
摘要
Current multi-modality driving frameworks normally fuse representation by utilizing attention between single-modality branches. However, the existing networks still suppress the driving performance as the Image and LiDAR branches are independent and lack a unified observation representation. Thus, this paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training. The masked training enhances the fusion representation by reconstruction on masked tokens. Architecturally, a hybrid-fusion network is proposed to combine advantages from both early and late fusion: For the early fusion stage, modalities are fused by performing monotonic-to-BEV translation attention between branches; Late fusion is performed by tokenizing various modalities into a unified token space with shared encoding on it. MaskFuser respectively reaches a driving score of 49.05 and route completion of 92.85% on the CARLA LongSet6 benchmark evaluation, which improves the best of previous baselines by 1.74 and 3.21%. The introduced masked fusion increases driving stability under damaged sensory inputs. MaskFuser outperforms the best of previous baselines on driving score by 6.55 (27.8%), 1.53 (13.8%), 1.57 (30.9%), respectively given sensory masking ratios 25%, 50%, and 75%.
科研通智能强力驱动
Strongly Powered by AbleSci AI