计算机科学
杠杆(统计)
突出
图像检索
情态动词
人工智能
计算机视觉
遥感
情报检索
图像(数学)
地质学
化学
高分子化学
作者
Jinghao Huang,Yaxiong Chen,Shengwu Xiong,Xiaoqiang Lu
标识
DOI:10.1109/tgrs.2023.3264006
摘要
Cross-modal remote sensing image-audio retrieval aims to use audio or remote sensing images as queries to retrieve relevant remote sensing images or corresponding audios. Although many approaches leverage labeled samples to achieve good performance, the performance cost of labeled samples is high, because cross-modal remote sensing labeled samples usually requires huge labor resources. Therefore, unsupervised cross-modal learning is very important in real-world applications. In this paper, we propose a novel unsupervised cross-modal remote sensing image-audio retrieval approach, named Self-Supervision Interactive Alignment (SSIA), which can take advantage of large amounts of unlabeled samples to learn the salient information, cross-modal alignment and the similarity between remote sensing images and audios. Since self-supervised learning lacks the supervision of label information, we leverage the similarity between the input remote sensing image information and audio information as the supervision information. Besides, to perform cross-modal alignment, a novel interactive alignment module is designed to explore fine correspondence relation for remote sensing images and audios. Moreover, we design an audio guided image de-redundant module to reduce the redundant information of visual information, which can capture salient information of remote sensing images. Extensive experiments on four widely-used remote sensing image-audio datasets testify that the SSIA perform gain better remote sensing image-audio retrieval performance than other compared approaches.
科研通智能强力驱动
Strongly Powered by AbleSci AI