计算机科学
桥接(联网)
帕斯卡(单位)
人工智能
匹配(统计)
语义学(计算机科学)
水准点(测量)
班级(哲学)
边距(机器学习)
模式识别(心理学)
图像(数学)
可视化
语义鸿沟
自然语言处理
机器学习
图像检索
数学
统计
计算机网络
程序设计语言
大地测量学
地理
作者
Leilei Ma,Hongxing Xie,Lei Wang,Yanping Fu,Dengdi Sun,Haifeng Zhao
标识
DOI:10.1145/3664647.3680815
摘要
Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose $\textbf{T}$ext-$\textbf{R}$egion $\textbf{M}$atching for optimizing $\textbf{M}$ulti-$\textbf{L}$abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art methods by a significant margin. Our code is available here: https://github.com/yu-gi-oh-leilei/TRM-ML.
科研通智能强力驱动
Strongly Powered by AbleSci AI