EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

计算机科学情态动词遥感图像传感器领域（数学分析）图像（数学）理解力人工智能计算机视觉地质学数学数学分析化学高分子化学程序设计语言

作者

Wei Zhang,Miaoxin Cai,Tong Zhang,Yin Zhuang,Xuerui Mao

出处

期刊：IEEE Transactions on Geoscience and Remote Sensing [Institute of Electrical and Electronics Engineers]
日期：2024-01-01 卷期号：62: 1-20 被引量：20

链接

arxiv.org arxiv.orgdoi.org

标识

DOI：10.1109/tgrs.2024.3409624

摘要

Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant domain gap between natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper for universal RS image comprehension. Firstly, a visual-enhanced perception mechanism is constructed to refine and incorporate coarse-scale semantic perception information and fine-scale detailed perception information. Secondly, a cross-modal mutual comprehension approach is proposed, aiming at enhancing the interplay between visual perception and language comprehension and deepening the comprehension of both visual and language content. Finally, a unified instruction tuning method for multi-sensor multi-task in the RS domain is proposed to unify a wide range of tasks including scene classification, image captioning, region-level captioning, visual question answering (VQA), visual grounding, object detection, etc. More importantly, a dataset named MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multi-sensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks. Our code and dataset are available at https://github.com/wivizhang/EarthGPT.

求助该文献

最长约 10秒，即可获得该文献文件

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

今日热心研友