计算机科学
变压器
接地
情态动词
遥感
比例(比率)
计算机视觉
人工智能
电压
地质学
电气工程
工程类
地理
化学
地图学
高分子化学
作者
Meng Lan,Fu Rong,Hongzan Jiao,Zhi Gao,Lefei Zhang
标识
DOI:10.1109/tgrs.2024.3407598
摘要
Visual grounding for remote sensing images (RSVG) aims to localize the referred objects in the remote sensing (RS) images according to a language expression. Existing methods tend to align visual and text features followed by concatenation and then employ a fusion Transformer to learn a token representation for final target localization. However, simple fusion Transformer structure fails to sufficiently learn the location representation of referred object from the multi-modal features. Inspired by the detection Transformer, in this paper, we propose a novel language query based Transformer framework for RSVG termed LQVG. Specifically, we adopt the extracted sentence-level text features as the queries, called language queries, to retrieve and aggregate representation information of the referred object from the multi-scale visual features in the Transformer decoder. The language queries are then converted into object embeddings for final coordinate prediction of referred object. Besides, a multi-scale cross-modal alignment module is devised before the multimodal Transformer to enhance the semantic correlation between the visual and text features, thus facilitating the cross-modal decoding process to generate more precise object representation. Moreover, a new RSVG dataset named RSVG-HR is built to evaluate the performance of the RSVG approaches on very high-resolution remote sensing images with inconspicuous objects. Experimental results on two benchmark datasets demonstrate that our proposed method significantly surpasses the comparison methods and achieves state-of-the-art performance. The dataset and code are available at https://github.com/LANMNG/LQVG.
科研通智能强力驱动
Strongly Powered by AbleSci AI