Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness Perception

计算机科学语义学（计算机科学）任务（项目管理）图像（数学）情报检索嵌入图像检索模态（人机交互）人工智能维数（图论）语义鸿沟保险丝（电气）模式识别（心理学）计算机视觉数学管理纯数学经济程序设计语言电气工程工程类

作者

Gangjian Zhang,Shikui Wei,Huaxin Pang,Shuang Qiu,Yao Zhao

出处

期刊：IEEE Transactions on Multimedia [Institute of Electrical and Electronics Engineers]
日期：2023-05-05 卷期号：26: 916-928 被引量：2

标识

DOI：10.1109/tmm.2023.3273466

摘要

Composed image retrieval (CIR) is an emerging and challenging research task that combines two modalities, a reference image, and a modification text, into one query to retrieve the target image. In online shopping scenarios, the user would use the modification text as feedback to describe the difference between the reference and the desired image. In order to handle the task, there must be two main problems needed to be addressed. One is the localization problem: how to precisely find those spatial areas of the image mentioned by the text. The other is the modification problem: how to effectively modify the image semantics based on the text. However, existing methods merely fuse information coarsely from the two-modality, while the accurate spatial and semantic correspondence between these two heterogeneous features tends to be neglected. Therefore, image details cannot be precisely located and modified. To this end, we consider integrating information from the two modalities more accurately from spatial and semantic aspects. Thus, we propose an end-to-end framework for the CIR task, which contains three key components, i.e., Multi-level Collaborative Localization module (MCL), Differential Semantics Discrimination module (DSD), and Image Difference Enhancement constraints (IDE). Specifically, to solve the localization problem, MCL precisely locates the text to the image areas by collaboratively using text positioning information on multiple image layers. For the modification problem, DSD builds a distribution to evaluate the modification possibility of each image semantic dimension, and IDE effectively learns the modification patterns of text against image embedding based on the distribution. Extensive experiments on three datasets show that the proposed method achieves outstanding performance against the SOTA methods.

求助该文献

最长约 10秒，即可获得该文献文件

Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness Perception

今日热心研友