RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

计算机科学 自然语言 人工智能 机器人 一般化 互联网 人机交互 语言模型 自然语言处理 万维网 数学 数学分析
作者
Anthony Brohan,Noah Brown,Justice Carbajal,Yevgen Chebotar,Xi Chen,Krzysztof Choromański,Tianli Ding,Danny Driess,Avinava Dubey,Chelsea Finn,Pete Florence,Chuyuan Fu,Montse Gonzalez Arenas,Keerthana Gopalakrishnan,Kehang Han,Karol Hausman,Alexander Herzog,Jasmine Hsu,Brian Ichter,Alex Irpan
出处
期刊:Cornell University - arXiv 被引量:260
标识
DOI:10.48550/arxiv.2307.15818
摘要

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
1秒前
1秒前
1秒前
perovskite完成签到,获得积分10
2秒前
淡定的勒发布了新的文献求助10
2秒前
青青青青发布了新的文献求助10
2秒前
sandy发布了新的文献求助10
3秒前
椰子味发布了新的文献求助10
3秒前
翁瑞婷完成签到,获得积分10
3秒前
王大壮发布了新的文献求助10
3秒前
AKA发布了新的文献求助10
3秒前
小鱼发布了新的文献求助10
4秒前
隐形曼青应助徐小采纳,获得10
5秒前
5秒前
雏菊发布了新的文献求助10
6秒前
Lucas应助赛赛采纳,获得10
6秒前
8秒前
ztj完成签到,获得积分10
8秒前
浮游应助青青青青采纳,获得10
9秒前
1111发布了新的文献求助10
11秒前
阿俊完成签到 ,获得积分10
11秒前
安静夏天发布了新的文献求助10
12秒前
hehe完成签到 ,获得积分10
12秒前
香菜完成签到 ,获得积分10
12秒前
ddffgz发布了新的文献求助10
12秒前
阿里院士发布了新的文献求助10
14秒前
wy完成签到,获得积分10
14秒前
jyoraku完成签到,获得积分10
15秒前
风清扬发布了新的文献求助10
16秒前
王大壮完成签到,获得积分10
17秒前
17秒前
Popo完成签到,获得积分10
21秒前
ddffgz完成签到,获得积分20
21秒前
22秒前
无心客应助贤惠的早晨采纳,获得30
23秒前
SciGPT应助叁叁采纳,获得10
23秒前
吉喆完成签到,获得积分10
26秒前
CodeCraft应助可靠的念云采纳,获得10
26秒前
CipherSage应助仁爱雪晴采纳,获得10
34秒前
精明玲完成签到 ,获得积分10
36秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
On the Angular Distribution in Nuclear Reactions and Coincidence Measurements 1000
Vertébrés continentaux du Crétacé supérieur de Provence (Sud-Est de la France) 600
A complete Carnosaur Skeleton From Zigong, Sichuan- Yangchuanosaurus Hepingensis 四川自贡一完整肉食龙化石-和平永川龙 600
FUNDAMENTAL STUDY OF ADAPTIVE CONTROL SYSTEMS 500
微纳米加工技术及其应用 500
Nanoelectronics and Information Technology: Advanced Electronic Materials and Novel Devices 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 物理化学 基因 遗传学 催化作用 冶金 量子力学 光电子学
热门帖子
关注 科研通微信公众号,转发送积分 5308956
求助须知:如何正确求助?哪些是违规求助? 4453860
关于积分的说明 13858358
捐赠科研通 4341612
什么是DOI,文献DOI怎么找? 2384051
邀请新用户注册赠送积分活动 1378620
关于科研通互助平台的介绍 1346619