计算机科学
隐藏字幕
判别式
人工智能
自然语言处理
生成语法
答疑
任务(项目管理)
一般化
语言模型
分类器(UML)
建筑
集合(抽象数据类型)
生成模型
图像(数学)
程序设计语言
艺术
数学分析
视觉艺术
数学
管理
经济
作者
Jaemin Cho,Jie Lei,Hao Tan,Mohit Bansal
出处
期刊:Cornell University - arXiv
日期:2021-01-01
被引量:21
标识
DOI:10.48550/arxiv.2102.02779
摘要
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5
科研通智能强力驱动
Strongly Powered by AbleSci AI