计算机科学
边距(机器学习)
编码(集合论)
人工智能
源代码
测距
语言模型
机器学习
模式识别(心理学)
自然语言处理
程序设计语言
集合(抽象数据类型)
电信
作者
Lin Xiao,Xiaoshan Yang,Fang Peng,Ming Yan,Yaowei Wang,Changsheng Xu
标识
DOI:10.1109/tmm.2023.3321501
摘要
Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image.To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels.However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity.In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels.We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding.Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels.Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78% to 10.67% and 11.39% to 14.87%, respectively.The results even outperform existing weakly supervised methods.Furthermore, our method is also competitive in fully supervised setting.The code and models are available at https://github.com/linhuixiao/CLIP-VG.
科研通智能强力驱动
Strongly Powered by AbleSci AI