计算机科学
语义相似性
人工智能
相似性(几何)
相似性学习
余弦相似度
度量(数据仓库)
匹配(统计)
判别式
自然语言处理
维数(图论)
相似性度量
模式识别(心理学)
情报检索
图像(数学)
数据挖掘
数学
统计
纯数学
作者
Kun Zhang,Bo Hu,Huatian Zhang,Zhe Li,Zhendong Mao
标识
DOI:10.1109/tcsvt.2023.3307554
摘要
Image-text matching is a fundamental task to bridge vision and language. The critical challenge lies in accurately learning the semantic similarity between these two heterogeneous modalities. For visual and textual features, existing methods typically default to a static dimensional correspondence mechanism, i.e ., using a single dimension as the measure-unit to perform one-to-one correspondence, to examine semantic similarity, e.g ., the cosine/Euclidean distance or the weighted similarity. In this paper, different from the single-dimensional correspondence with limited semantic expressive capability, we propose a novel enhanced semantic similarity learning (ESL), which generalizes both measure-units and their correspondences into a dynamic learnable framework to examine the multi-dimensional enhanced correspondence between visual and textual features. Specifically, we first devise the intra-modal multi-dimensional aggregators with iterative enhancing mechanism, which dynamically captures new measure-units integrated by hierarchical multi-dimensions, producing diverse semantic combinatorial expressive capabilities to provide richer and discriminative information for similarity examination. Then, we devise the inter-modal enhanced correspondence learning with sparse contribution degrees, which comprehensively and efficiently determines the cross-modal semantic similarity. Extensive experiments verify its superiority in achieving state-of-the-art performance. Codes will be released.
科研通智能强力驱动
Strongly Powered by AbleSci AI