计算机科学
人工智能
语义学(计算机科学)
图像(数学)
自然语言处理
对象(语法)
突出
比例(比率)
培训(气象学)
计算机视觉
程序设计语言
量子力学
物理
气象学
作者
Xiujun Li,Xi Yin,Chunyuan Li,Pengchuan Zhang,Xiaowei Hu,Lei Zhang,Lijuan Wang,Houdong Hu,Dong Li,Furu Wei,Yejin Choi,Jianfeng Gao
出处
期刊:Cornell University - arXiv
日期:2020-01-01
被引量:57
标识
DOI:10.48550/arxiv.2004.06165
摘要
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI