计算机科学
人工智能
图像(数学)
计算机视觉
图像处理
匹配(统计)
模式识别(心理学)
数学
统计
作者
Guoxin Xiong,Meng Meng,Tianzhu Zhang,Dongming Zhang,Yongdong Zhang
标识
DOI:10.1109/tcsvt.2024.3392619
摘要
Image-text matching aims to bridge vision and language areas, which is a crucial task in multi-modal intelligence. The core idea is to learn features of each modality and aggregate learned features as holistic representations to measure image-text relevance. Most existing methods involve cross-modal interaction during feature learning by modeling fine-grained relationships between two modalities for better results. However, these methods may obtain wrong attention scores when directly computing similarities between regions and words. Besides, current methods mainly rely on simple pooling operations for feature aggregation, which introduces interference from redundant information, resulting in inaccurate matching results. To alleviate these issues, we propose a novel reference-aware adaptive network for image-text matching by jointly using a reference attention module for feature learning and an adaptive aggregation module for feature aggregation. The proposed model enjoys several merits. First, the designed reference attention module effectively reduces wrong attention scores by introducing a set of references during cross-modal interaction. Second, the proposed adaptive aggregation module highlights useful information adaptively while suppressing redundant information during aggregation. Extensive experiments on two standard benchmarks demonstrate that our method performs favorably against state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI