LLM-Enhanced Composed Image Retrieval: An Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model

计算机科学嵌入匹配（统计）人工智能编码器情报检索情态动词自然语言处理计算机视觉数学统计操作系统化学高分子化学

作者

Hongfei Ge,Yuanchun Jiang,Jianshan Sun,Kun Yuan,Yezheng Liu

链接

acm.orgdoi.org

标识

摘要

Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually annotated datasets limit the further performance improvement of CoIR. To this end, this article proposes an LLM-enhanced Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for CoIR. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.

求助该文献

最长约 10秒，即可获得该文献文件

LLM-Enhanced Composed Image Retrieval: An Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model

今日热心研友