RGBX-DiffusionDet: a framework for multi-modal RGB-X object detection using DiffusionDet

计算机科学人工智能判别式目标检测特征（语言学）模式识别（心理学）计算机视觉特征提取传感器融合块（置换群论）卷积神经网络降维特征学习模块化设计频道（广播）解码方法 RGB颜色模型嵌入正规化（语言学）最小边界框对象（语法）分割特征向量特征模型视觉对象识别的认知神经科学图像融合

作者

Eliraz Orfaig,Inna Stainvas,Igal Bilik

出处

期刊：Pattern Recognition [Elsevier BV]
日期：2025-09-25 卷期号：172: 112460-112460 被引量：3

链接

arxiv.org arxiv.orgdoi.org

标识

DOI：10.1016/j.patcog.2025.112460

摘要

• Development of RGBX-DiffusionDet, a modular and extensible framework that demonstrates the feasibility of integrating auxiliary 2D data into DiffusionDet. • Introduction of DCR-CBAM, a dynamic feature fusion approach. • Introduction of DMLAB, a dynamic feature aggregation operation, designed to enhance the performance of the diffusion decoding process. • Novel regularization losses that enforce channel saliency and spatial selectivity, enabling compact and discriminative feature embeddings. • The first use of pixel-aligned RGB-P data for object detection, including the generation of bounding box annotations, to motivate future research in multi-modal data processing. This work addresses the challenge of object detection using multimodal heterogeneous sensors by extending the recently proposed DiffusionDet framework, initially designed for RGB-only detection. We propose RGBX-DiffusionDet, a generalized diffusion-based object detection framework that enables seamless fusion of heterogeneous 2D modalities (denoted as “X”, e.g., depth, infrared, and polarimetric data) with RGB imagery. The proposed approach adopts a mid-level feature fusion strategy to address the heterogeneous nature of multimodal data, characterized by varying spatial resolutions, noise profiles, and semantic content. Instead of commonly used brute-force feature concatenation, we introduce two novel architectural components: (1) a dynamic channel reduction convolutional block attention module (DCR-CBAM), which enhances cross-modal fusion by emphasizing salient channel features while reducing the dimensionality of merged RGB-X features, and (2) a dynamic multi-level aggregation block (DMLAB), which addresses a limitation of the baseline DiffusionDet decoder by adaptively fusing spatial features to improve object localization. Additionally, we incorporate novel regularization losses that promote channel saliency and spatial selectivity, resulting in compact and discriminative feature embeddings. Extensive experiments on RGB-depth (KITTI), a newly annotated RGB-polarimetric (RGB-P) dataset, and RGB-infrared (M3FD) benchmarks demonstrate consistent superiority of the proposed approach over RGB-only baselines, while maintaining decoding efficiency. We further show that RGBX-DiffusionDet exhibits improved robustness and generalization capability in visually-corrupted conditions, demonstrating its practical efficiency for robust multimodal object detection.

求助该文献

最长约 10秒，即可获得该文献文件

RGBX-DiffusionDet: a framework for multi-modal RGB-X object detection using DiffusionDet

今日热心研友