计算机科学
背景(考古学)
人工智能
计算机视觉
人机交互
地理
考古
作者
Jian Sun,Junlang Huang,Xinyu Jiang,Yimin Zhou,Chi‐Man Vong
标识
DOI:10.1109/tcsvt.2025.3604002
摘要
Cross-View Geo-Localization is essential for drone visual localization and navigation, which aims at establishing correlation between images collected by unmanned aerial vehicle (UAV) and satellite platforms in the same geographic area. Drastic changes in the drone’s viewpoints pose a significant challenge for methods based on image representation mining. Previous studies attempt to learn fine-grained image appearance features from various perspectives; however, they tend to underutilize the various state information of the UAV. This paper proposes a novel multimodal framework, CGSI (Context-Guided and UAV’s Status Informed), which leverages UAV state textual descriptions to mitigate scene bias caused by viewpoint differences. The following two issues are addressed to achieve more accurate and reliable multimodal geo-localization: 1) The domain gap across different datasets caused by the fixed UAV altitudes. We propose a Context-Guided Multimodal Tokenizer, which learns contextual vectors from multi-altitude visual features and utilizes them as adaptive text tokens. 2) Multimodal features are susceptible to state-feature ambiguity. We propose a Drone Group Graph Attention method to enhance the association between UAV visual feature with the same location ID but different states and exploit the intrinsic relationships to extract discriminative multimodal features. Extensive experiments on the University-1652 and SUES benchmark demonstrate that our CGSI significantly outperforms existing algorithms, achieving state-of-the-art performance. The substantial improvements observed in cross-region ablation experiments further showcase the superior domain generalization capability of our method.
科研通智能强力驱动
Strongly Powered by AbleSci AI