计算机科学
粒度
模态(人机交互)
集合(抽象数据类型)
目标检测
人工智能
钥匙(锁)
任务(项目管理)
遥感
桥(图论)
训练集
对象(语法)
桥接(联网)
计算机视觉
自然语言处理
任务分析
语义学(计算机科学)
视觉对象识别的认知神经科学
联轴节(管道)
自然语言
模拟退火
结构化预测
语义映射
模式(计算机接口)
作者
Yuxuan Li,Yuming Chen,Yunheng Li,Ming-Ming Cheng,Xiang Li,Jian Yang
出处
期刊:Cornell University - arXiv
日期:2026-03-02
标识
DOI:10.48550/arxiv.2603.01758
摘要
Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.
科研通智能强力驱动
Strongly Powered by AbleSci AI