MM-DINO: DINOv3-Based Universal Framework for Uni and Multimodal Remote Sensing Image Semantic Segmentation

遥感计算机科学图像分割分割人工智能计算机视觉遥感应用图像（数学）图像处理地球观测像素图像融合上下文图像分类合成孔径雷达图像分辨率雷达成像卫星图像高光谱成像

作者

Yuan Qin,Chanling Pan,Jinyun Chen,Ruibo Chen,Jiaxing Chen,Ruichao Qu

出处

期刊：IEEE Transactions on Geoscience and Remote Sensing [Institute of Electrical and Electronics Engineers]
日期：2026-01-01 卷期号：64: 1-12

标识

DOI：10.1109/tgrs.2026.3677346

摘要

Semantic segmentation of high-resolution remote sensing imagery faces core challenges of scarce annotated data and weak model generalization. Although leveraging large-scale pre-trained foundation models is considered key to breaking through these bottlenecks, directly adapting them to remote sensing tasks still faces three major issues: architectural mismatch, modality rigidity, and the difficulty of balancing efficiency with generalization. To address these, this paper proposes MM-DINO, a universal and efficient framework based on DINOv3 for unimodal and multi-modal remote sensing image semantic segmentation. The framework employs a ”Frozen Backbone-Adapter-Decoder” design: first, the pre-trained DINOv3 backbone is kept entirely frozen to preserve its general visual knowledge; second, a Modality-Adaptive Adapter is designed to transform sequential features into spatial pyramid features and enable early, soft cross-modal interaction via learnable weights; finally, a Feature Enhancement and Refinement Decoder is responsible for multi-scale context aggregation, adaptive multi-modal fusion, and progressive feature refinement. Extensive experiments on the ISPRS Vaihingen, Potsdam, and WHU-OPT-SAR datasets demonstrate the effectiveness of MM-DINO. Under the unimodal setting, our method achieves mIoUs of 83.93%, 86.49%, and 55.72% on the three datasets respectively, while under the multi-modal setting, it achieves 84.32%, 86.54%, and 55.92%, all outperforming current state-of-the-art methods. Most notably, in zero-shot cross-dataset generalization experiments (trained on Vaihingen and tested on Potsdam), our method achieves 35.80% mIoU, significantly surpassing existing approaches and demonstrating remarkable domain robustness. Furthermore, efficiency analysis indicates that the framework achieves a favorable balance between accuracy and computational cost. The code will be open at: https://github.com/KimotaQY/MM-DINO.

求助该文献

最长约 10秒，即可获得该文献文件

MM-DINO: DINOv3-Based Universal Framework for Uni and Multimodal Remote Sensing Image Semantic Segmentation

今日热心研友