计算机科学
特征(语言学)
人工智能
语义学(计算机科学)
分割
地理空间分析
遥感
图像分割
特征提取
图像融合
计算机视觉
遥感应用
模式识别(心理学)
传感器融合
融合
特征向量
语义特征
图像(数学)
可视化
图像纹理
作者
Jiayuan Li,Zhen Wang,Xiao Fei Sun,Yiming Yao,Nan Xu,Zhu‐Hong You,Huang De-Shuang
标识
DOI:10.1109/tgrs.2026.3666675
摘要
Referring remote sensing image segmentation (RSRIS) aims to achieve target-oriented, fine-grained understanding of geospatial information by leveraging both visual and linguistic modalities. Different from traditional remote sensing semantic segmentation, RSRIS needs to address more complex contextual relationships and pronounced scale variations inherent to remote sensing imagery, which pose significant challenges for precise alignment and fusion between textual semantics and visual features. To tackle these issues, we propose a novel framework, termed VSPNet, for vision-language guided remote sensing referring segmentation. Specifically, VSPNet adopts a hybrid backbone architecture based on CNN, Transformer, and Mamba models to extract rich, multi-dimensional visual representations, thereby enabling more effective cross-modal interaction. Furthermore, we design a hierarchical multimodal feature fusion strategy tailored for RSRIS: at the shallow feature level, a Text-guided Texture Interaction Module (TTIM) is introduced to enhance the integration of fine-grained texture details with textual cues; at the deep feature level, a Text-guided Semantic Fusion Module (TSFM) is developed to facilitate global contextual alignment between segmentation targets and semantic expressions in language. Extensive experiments on public benchmarks, RefSegRS and RRSIS-D, demonstrate that VSPNet consistently outperforms state-of-the-art methods. Comprehensive ablation studies further verify the necessity and effectiveness of each constructed component. The code is available at https://github.com/NWPUFranklee/VSPNet.git.
科研通智能强力驱动
Strongly Powered by AbleSci AI