计算机科学
点云
接头(建筑物)
分割
激光雷达
计算机视觉
人工智能
点(几何)
遥感
地质学
数学
建筑工程
几何学
工程类
作者
Yue Wu,Jiaming Liu,Maoguo Gong,Qiguang Miao,Wenping Ma,Cai Xu
标识
DOI:10.1016/j.inffus.2024.102370
摘要
LiDAR and camera are two common vision sensors used in the real world, producing complementary point cloud and image data. While multimodal data has previously been found mostly in 3D detection and tracking, we aim to study large-scale semantic segmentation by multimodal data fusion rather than only knowledge transfer or distillation. We show that fusing LiDAR features with camera features and abandoning the strict point-to-pixel hard correlation can lead to better performance. Even so, it is still difficult to make full use of multimodal data due to the spatiotemporal misalignment of sensors and uneven data distribution.. To address this issue, we propose the Joint Semantic Segmentation (JoSS), a powerful LiDAR-camera fusion solution that employs the attention mechanism to explore the potential relationships between point clouds and images. Specifically, JoSS consists of commonly used 3D and 2D backbones, and lightweight transformer decoders based on point clouds and images. A point cloud decoder adopts queries to analyze the semantics from LiDAR features, and an image decoder adaptively fuses these queries with corresponding image features. Both exploit contextual information, thus fully mining multimodal information for semantic segmentation. In addition, we propose an effective unimodal data augmentation (UDA) method that performs cross-modal contrastive learning on point clouds and images to significantly improve accuracy by augmenting the point cloud alone without the complexity of generating paired samples of both modalities. Our Joss achieves advanced results in two widely used large-scale benchmarks, i.e. SemanticKITTI and nuScenes-lidarseg.
科研通智能强力驱动
Strongly Powered by AbleSci AI