人工智能
计算机视觉
计算机科学
目标检测
遥感
探测器
模式识别(心理学)
地质学
电信
作者
Jiaojiao Li,Penghao Tian,Rui Song,Haitao Xu,Yunsong Li,Qian Du
标识
DOI:10.1109/tgrs.2024.3360456
摘要
Remote sensing object detection (RSOD) is a fundamental and valuable task in Earth monitoring. However, remote sensing images (RSIs) are typically acquired from a bird’s eye perspective, resulting in intrinsic properties such as the complex backgrounds, random and dense distribution of objects, and multiscale objects. These properties hinder the direct application of well-performed detection methods in the natural images (NIs) domain to the RSIs domain, thereby limiting the attainment of desired performance. To address this, we propose a pyramid convolutional vision transformer (PCViT) that gets rid of the limitations of existing transformer methods. Firstly, we employ a pyramid architecture to effectively capture the multiscale information present in RSIs. To enhance the feature extraction capabilities of the transformer, we introduce a parallel convolution module (PCM) that complements the local information that may be missed by the transformer. Furthermore, we propose a self-supervised pretraining strategy called multi-perspective pretraining (MPP) to pretrain the model and subsequently finetune it on the downstream detection task. During the finetuning stage, we introduce a Local/global k -NN attention (LGKA) to improve the token relationship establishment. In the neck part, we propose a feature-reflowing pyramid network (FRPN) to facilitate contextual information interaction and further enhance our PCViT’s ability to process multiscale information. Experimental results on two representative datasets, namely NWPU VHR-10 and DIOR, demonstrate the effectiveness of our PCViT, as it achieves outstanding performance. These results highlight the suitability of PCViT for RSOD tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI