摘要
• Development of RGBX-DiffusionDet, a modular and extensible framework that demonstrates the feasibility of integrating auxiliary 2D data into DiffusionDet. • Introduction of DCR-CBAM, a dynamic feature fusion approach. • Introduction of DMLAB, a dynamic feature aggregation operation, designed to enhance the performance of the diffusion decoding process. • Novel regularization losses that enforce channel saliency and spatial selectivity, enabling compact and discriminative feature embeddings. • The first use of pixel-aligned RGB-P data for object detection, including the generation of bounding box annotations, to motivate future research in multi-modal data processing. This work addresses the challenge of object detection using multimodal heterogeneous sensors by extending the recently proposed DiffusionDet framework, initially designed for RGB-only detection. We propose RGBX-DiffusionDet, a generalized diffusion-based object detection framework that enables seamless fusion of heterogeneous 2D modalities (denoted as “X”, e.g., depth, infrared, and polarimetric data) with RGB imagery. The proposed approach adopts a mid-level feature fusion strategy to address the heterogeneous nature of multimodal data, characterized by varying spatial resolutions, noise profiles, and semantic content. Instead of commonly used brute-force feature concatenation, we introduce two novel architectural components: (1) a dynamic channel reduction convolutional block attention module (DCR-CBAM), which enhances cross-modal fusion by emphasizing salient channel features while reducing the dimensionality of merged RGB-X features, and (2) a dynamic multi-level aggregation block (DMLAB), which addresses a limitation of the baseline DiffusionDet decoder by adaptively fusing spatial features to improve object localization. Additionally, we incorporate novel regularization losses that promote channel saliency and spatial selectivity, resulting in compact and discriminative feature embeddings. Extensive experiments on RGB-depth (KITTI), a newly annotated RGB-polarimetric (RGB-P) dataset, and RGB-infrared (M3FD) benchmarks demonstrate consistent superiority of the proposed approach over RGB-only baselines, while maintaining decoding efficiency. We further show that RGBX-DiffusionDet exhibits improved robustness and generalization capability in visually-corrupted conditions, demonstrating its practical efficiency for robust multimodal object detection.