计算机科学
稳健性(进化)
人工智能
计算机视觉
眼动
语言模型
上下文模型
可视化
语义学(计算机科学)
背景(考古学)
变压器
主动外观模型
跟踪(教育)
编码(内存)
视频跟踪
语言理解
编码(集合论)
特征提取
机器学习
跟踪系统
图像分割
杠杆(统计)
感觉线索
作者
Jie Zhao,Xin Chen,Shengming Li,Chunjuan Bo,Dong Wang,Huchuan Lu
标识
DOI:10.1109/tip.2025.3635016
摘要
Due to the substantial gap between vision and language modalities, along with the mismatch problem between fixed language descriptions and dynamic visual information, existing vision-language tracking methods exhibit performance on par with or slightly worse than vision-only tracking. Effectively exploiting the rich semantics of language to enhance tracking robustness remains an open challenge. To address these issues, we propose a self-adaptive vision-language tracking framework that leverages the pre-trained multi-modal CLIP model to obtain well-aligned visual-language representations. A novel context-aware prompting mechanism is introduced to dynamically adapt linguistic cues based on the evolving visual context during tracking. Specifically, our context prompter extracts dynamic visual features from the current search image and integrates them into the text encoding process, enabling self-updating language embeddings. Furthermore, our framework employs a unified one-stream Transformer architecture, supporting joint training for both vision-only and vision-language tracking scenarios. Our method not only bridges the modality gap but also enhances robustness by allowing language features to evolve with visual context. Extensive experiments on four vision-language tracking benchmarks demonstrate that our method effectively leverages the advantages of language to enhance visual tracking. Our large model can obtain 55.0% AUC on $\text {LaSOT}_{\text {EXT}}$ and 69.0% AUC on TNL2K. Additionally, our language-only tracking model achieves performance comparable to that of state-of-the-art vision-only tracking methods on TNL2K. Code is available at https://github.com/zj5559/SAVLT.
科研通智能强力驱动
Strongly Powered by AbleSci AI