计算机科学
情报检索
排名(信息检索)
人工智能
语义学(计算机科学)
图像检索
学习排名
一致性(知识库)
显式语义分析
特征(语言学)
视觉文字
语义计算
图像(数学)
语义网
语义技术
语言学
哲学
程序设计语言
作者
Qingrong Cheng,Zhenshan Tan,Keyu Wen,Cheng Chen,Xiaodong Gu
标识
DOI:10.1109/tcsvt.2022.3182549
摘要
Cross-modal retrieval aims at retrieving highly semantic relevant information among multi-modalities. Existing cross-modal retrieval methods mainly explore the semantic consistency between image and text while rarely consider the rankings of positive instances in the retrieval results. Moreover, these methods seldom take into account the cross-interaction between image and text, which leads to the deficiency of learning their semantic relations. In this paper, we propose a Unified framework with Ranking Learning (URL) for cross-modal retrieval. The unified framework consists of three sub-networks, visual network, textual network, and interaction network. Visual network and textual network project the image feature and text feature into their corresponding hidden spaces respectively. Then, the interaction network forces the target image-text representation to align in the common space. For unifying both semantics and rankings, we propose a new optimization paradigm including pre-alignment for semantic knowledge transfer and ranking learning for final retrieval, which can decouple semantic alignment and ranking learning. The former focuses on the semantic pre-alignment optimized by semantic classification and the latter revolves around the retrieval rankings. For the ranking learning, we introduce a cross-AP loss which can directly optimize the retrieval metric average precision for cross-modal retrieval. We conduct experiments on four widely-used benchmarks, including Wikipedia dataset, Pascal Sentence dataset, NUS-WIDE-10k dataset, and PKU XMediaNet dataset respectively. Extensive experimental results show that the proposed method can obtain higher retrieval precision.
科研通智能强力驱动
Strongly Powered by AbleSci AI