To address the limitations of unimodal visual detection in complex scenarios involving low illumination, occlusion, and texture-sparse environments, this paper proposes an improved YOLOv11-based dual-branch RGB-D fusion framework. The symmetric architecture processes RGB images and depth maps in parallel, integrating a Dual-Encoder Cross-Attention (DECA) module for cross-modal feature weighting and a Dual-Encoder Feature Aggregation (DEPA) module for hierarchical fusion—where the RGB branch captures texture semantics while the depth branch extracts geometric priors. To comprehensively validate the effectiveness and generalization capability of the proposed framework, we designed a multi-stage evaluation strategy leveraging complementary benchmark datasets. On the M3FD dataset, the model was evaluated under both RGB-depth and RGB-infrared configurations to verify core fusion performance and extensibility to diverse modalities. Additionally, the VOC2007 dataset was augmented with pseudo-depth maps generated by Depth Anything, assessing adaptability under monocular input constraints. Experimental results demonstrate that our method achieves mAP50 scores of 82.59% on VOC2007 and 81.14% on M3FD in RGB-infrared mode, outperforming the baseline YOLOv11 by 5.06% and 9.15%, respectively. Notably, in the RGB-depth configuration on M3FD, the model attains a mAP50 of 77.37% with precision of 88.91%, highlighting its robustness in geometric-aware detection tasks. Ablation studies confirm the critical roles of the Dynamic Branch Enhancement (DBE) module in adaptive feature calibration and the Dual-Encoder Attention (DEA) mechanism in multi-scale fusion, significantly enhancing detection stability under challenging conditions. With only 2.47M parameters, the framework provides an efficient and scalable solution for high-precision spatial perception in autonomous driving and robotics applications.