计算机科学
散列函数
构造(python库)
判别式
人工智能
特征提取
特征(语言学)
光学(聚焦)
图像检索
模式识别(心理学)
机器学习
图像(数学)
计算机安全
哲学
物理
程序设计语言
光学
语言学
作者
Hui Yu,Shuyan Ding,Lunbo Li,Jing Wu
标识
DOI:10.1145/3551626.3564945
摘要
With the explosive growth of multi-modal data such as video, images, and text on the Internet, cross-modal retrieval has received extensive attention, especially the deep hashing method. Compared with the real-value method, deep hashing has shown promising prospects due to its low memory consumption and high searching efficiency. However, most existing studies have difficulties in effectively utilizing the raw image-text pairs to generate discriminative feature representations. Moreover, these methods ignore the latent relationship between different modalities and fail to construct a robust similarity matrix, resulting in suboptimal retrieval performance. In this paper, we focus on the unsupervised cross-modal hashing tasks and propose a Self Attentive CLIP Hashing (SACH) model. Specifically, we construct the feature extraction network by employing the pre-trained CLIP model, which has shown excellent performance in zero-shot tasks. Besides, to fully exploit the semantic relationships, an attention module is introduced to reduce the disturbance of redundant information and focus on important information. On this basis, we construct a semantic fusion similarity matrix that capable of preserving the original semantic relationships from different modalities. Extensive experiments show the superiority of SACH compared with recent state-of-the-art unsupervised hashing methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI