粒度
计算机科学
语义学(计算机科学)
师(数学)
情报检索
人工智能
计算机视觉
自然语言处理
程序设计语言
数学
算术
作者
Zhimin Wei,Z. Y. Zhang,Peng Wu,Ji Wang,Peng Wang,Yanning Zhang
标识
DOI:10.1109/tcsvt.2024.3392831
摘要
Text-based Person Retrieval aims to search the target pedestrian image from video surveillance or a large image database with a text description. Previous works have recognized the significance of mining local information in images and descriptions and performing fine-grained alignment. These approaches adopt hard division or auxiliary networks for locating local visual regions. However, the two existing ways are not flexible enough for various images and may even bring noise. Meanwhile, the Vision-Language Pre-training models like CLIP exhibit strong generalization and zero-shot abilities, which provide an available way to this issue. In this paper, we propose a novel Fine-Granularity Alignment model with Semantics-Centric Visual Division (SCVD). Our method contains a Semantics Deconstructor (SD), a Cross-modal Guided Interaction (CGI) module, and a Dynamic Focus Alignment (DFA) module. The SD aims to extract fine-grained semantic prompts from the raw description which is easy-understand for CLIP. In CGI, we propose a Text-Guided Visual Localization (TVL) module to generate local visual representations according to the semantic prompts and a Vision-Guided Semantics Reconstruction (VSR) module to integrate the prompts into the textual representation. The DFA is used finally to align vision-text fine-grained information. The extensive experiments demonstrate that our proposed framework significantly outperforms current state-of-the-art methods in terms of Rank@1 metric on three benchmarks by an absolute gain of 6.56%, 8.93%, and 11.53%, respectively. Our code is available in https://github.com/tujun233/SCVD.git.
科研通智能强力驱动
Strongly Powered by AbleSci AI