模式
计算机科学
情态动词
多样性(控制论)
人工智能
视觉科学
自然语言处理
社会科学
社会学
化学
高分子化学
作者
Amanpreet Singh,Ronghang Hu,Vedanuj Goswami,Guillaume Couairon,Wojciech Galuba,Marcus Rohrbach,Douwe Kiela
标识
DOI:10.1109/cvpr52688.2022.01519
摘要
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a “foundation”, that targets all modalities at once-a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
科研通智能强力驱动
Strongly Powered by AbleSci AI