Self-Adaptive Vision-Language Tracking With Context Prompting

计算机科学稳健性（进化）人工智能计算机视觉眼动语言模型上下文模型可视化语义学（计算机科学）背景（考古学）变压器主动外观模型跟踪（教育）编码（内存）视频跟踪语言理解编码（集合论）特征提取机器学习跟踪系统图像分割杠杆（统计）感觉线索

作者

Jie Zhao,Xin Chen,Shengming Li,Chunjuan Bo,Dong Wang,Huchuan Lu

出处

期刊：IEEE transactions on image processing [Institute of Electrical and Electronics Engineers]
日期：2025-01-01 卷期号：34: 8046-8058

链接

nih.govdoi.org

标识

DOI：10.1109/tip.2025.3635016

摘要

Due to the substantial gap between vision and language modalities, along with the mismatch problem between fixed language descriptions and dynamic visual information, existing vision-language tracking methods exhibit performance on par with or slightly worse than vision-only tracking. Effectively exploiting the rich semantics of language to enhance tracking robustness remains an open challenge. To address these issues, we propose a self-adaptive vision-language tracking framework that leverages the pre-trained multi-modal CLIP model to obtain well-aligned visual-language representations. A novel context-aware prompting mechanism is introduced to dynamically adapt linguistic cues based on the evolving visual context during tracking. Specifically, our context prompter extracts dynamic visual features from the current search image and integrates them into the text encoding process, enabling self-updating language embeddings. Furthermore, our framework employs a unified one-stream Transformer architecture, supporting joint training for both vision-only and vision-language tracking scenarios. Our method not only bridges the modality gap but also enhances robustness by allowing language features to evolve with visual context. Extensive experiments on four vision-language tracking benchmarks demonstrate that our method effectively leverages the advantages of language to enhance visual tracking. Our large model can obtain 55.0% AUC on $\text {LaSOT}_{\text {EXT}}$ and 69.0% AUC on TNL2K. Additionally, our language-only tracking model achieves performance comparable to that of state-of-the-art vision-only tracking methods on TNL2K. Code is available at https://github.com/zj5559/SAVLT.

求助该文献

最长约 10秒，即可获得该文献文件

Self-Adaptive Vision-Language Tracking With Context Prompting

今日热心研友