作者
Yuan Qin,Chanling Pan,Jinyun Chen,Ruibo Chen,Jiaxing Chen,Ruichao Qu
摘要
Semantic segmentation of high-resolution remote sensing imagery faces core challenges of scarce annotated data and weak model generalization. Although leveraging large-scale pre-trained foundation models is considered key to breaking through these bottlenecks, directly adapting them to remote sensing tasks still faces three major issues: architectural mismatch, modality rigidity, and the difficulty of balancing efficiency with generalization. To address these, this paper proposes MM-DINO, a universal and efficient framework based on DINOv3 for unimodal and multi-modal remote sensing image semantic segmentation. The framework employs a ”Frozen Backbone-Adapter-Decoder” design: first, the pre-trained DINOv3 backbone is kept entirely frozen to preserve its general visual knowledge; second, a Modality-Adaptive Adapter is designed to transform sequential features into spatial pyramid features and enable early, soft cross-modal interaction via learnable weights; finally, a Feature Enhancement and Refinement Decoder is responsible for multi-scale context aggregation, adaptive multi-modal fusion, and progressive feature refinement. Extensive experiments on the ISPRS Vaihingen, Potsdam, and WHU-OPT-SAR datasets demonstrate the effectiveness of MM-DINO. Under the unimodal setting, our method achieves mIoUs of 83.93%, 86.49%, and 55.72% on the three datasets respectively, while under the multi-modal setting, it achieves 84.32%, 86.54%, and 55.92%, all outperforming current state-of-the-art methods. Most notably, in zero-shot cross-dataset generalization experiments (trained on Vaihingen and tested on Potsdam), our method achieves 35.80% mIoU, significantly surpassing existing approaches and demonstrating remarkable domain robustness. Furthermore, efficiency analysis indicates that the framework achieves a favorable balance between accuracy and computational cost. The code will be open at: https://github.com/KimotaQY/MM-DINO.