计算机科学
杠杆(统计)
关系(数据库)
人工智能
自然语言处理
模态(人机交互)
任务(项目管理)
多语种
任务分析
编码(集合论)
桥(图论)
空间关系
计算语言学
机器翻译
答疑
视觉推理
语言模型
对比度(视觉)
噪声数据
语言理解
情报检索
数据建模
语义学(计算机科学)
自动推理
作者
Min Cao,Xinyu Zhou,Ding Jiang,Bo Du,Mang Ye,Min Zhang
标识
DOI:10.1109/tpami.2025.3620139
摘要
Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.
科研通智能强力驱动
Strongly Powered by AbleSci AI