计算机科学
动作(物理)
认知科学
链条(单位)
人工智能
自然语言处理
心理学
物理
天文
量子力学
作者
Qingqing Zhao,Yao Lu,Moo Jin Kim,Zipeng Fu,Zhuoyang Zhang,Yunfeng Wu,Zhaoshuo Li,Qianli Ma,Song Han,Chelsea Finn,Ankur Handa,Tsung-Yi Lin,Gordon Wetzstein,Mingyu Liu,Donglai Xiang
标识
DOI:10.1109/cvpr52734.2025.00166
摘要
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input–output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Videos are available at: https://cot-vla.github.io/.
科研通智能强力驱动
Strongly Powered by AbleSci AI