计算机科学
隐藏字幕
对话框
答疑
集合(抽象数据类型)
自然语言处理
语言模型
编码(集合论)
阅读(过程)
管道(软件)
人工智能
图像(数学)
程序设计语言
万维网
语言学
哲学
作者
Jinze Bai,Shuai Bai,Shusheng Yang,Shijie Wang,Sinan Tan,Peng Wang,Junyang Lin,Chang Zhou,Jingren Zhou
出处
期刊:Cornell University - arXiv
日期:2023-01-01
被引量:72
标识
DOI:10.48550/arxiv.2308.12966
摘要
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
科研通智能强力驱动
Strongly Powered by AbleSci AI