Global and Local Semantic Completion Learning for Vision-Language Pre-Training

计算机科学人工智能自然语言处理培训（气象学）计算机视觉机器学习物理气象学

作者

Rong-Cheng Tu,Yatai Ji,Jie Jiang,Weijie Kong,Chengfei Cai,Wenzhe Zhao,Hongfa Wang,Yujiu Yang,Wei Liu

出处

期刊：IEEE Transactions on Pattern Analysis and Machine Intelligence [IEEE Computer Society]
日期：2025-08-06 卷期号：47 (12): 11065-11079 被引量：1

链接

nih.govdoi.org

标识

DOI：10.1109/tpami.2025.3596394

摘要

Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment, i.e., associations between image patches and text tokens. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

求助该文献

最长约 10秒，即可获得该文献文件

Global and Local Semantic Completion Learning for Vision-Language Pre-Training

今日热心研友