计算机科学
图像检索
人工智能
计算机视觉
图像匹配
图像(数学)
匹配(统计)
模式识别(心理学)
情报检索
数学
统计
作者
Hengchang Wang,Li Liu,Huaxiang Zhang,Lei Zhu,Xiaojun Chang,Hao Du
标识
DOI:10.1109/tcsvt.2025.3597097
摘要
Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with sparse entity references, creating a significant semantic gap with visual content. Current mainstream methods, primarily designed for strongly aligned data pairs, employ dynamic modeling or multi-dimensional similarity computation to achieve feature space mapping. However, they struggle with information asymmetry and modal heterogeneity in weakly aligned cases. To address this, we propose a Visual Perception Knowledge Enhancement (VPKE) framework. Unlike existing methods based on strong alignment assumptions, this framework mines latent image semantics through vision-language models and generates auxiliary captions, overcoming the information bottleneck of traditional text modalities. Its core innovation lies in an adaptive knowledge distillation mechanism that combines retrieval-augmented generation (RAG) with key entity extraction. This mechanism effectively filters noise when introducing external knowledge while optimizing cross-modal feature integration. The framework employs multi-level similarity evaluation to dynamically adjust fusion weights among original text, key entities, and auxiliary captions, enabling adaptive integration of diverse semantic features and significantly improving model flexibility. Additionally, multi-scale feature extraction further enhances cross-modal representation capabilities. Experimental results show that the proposed method performs excellently in image-text retrieval tasks on the MSCOCO and Flickr30K datasets, validating its effectiveness.
科研通智能强力驱动
Strongly Powered by AbleSci AI