计算机科学
情态动词
图形
场景图
人工智能
计算机视觉
人机交互
理论计算机科学
化学
高分子化学
渲染(计算机图形)
作者
Yu He,Kang Zhou,Tao Tian
标识
DOI:10.1007/s11227-024-06541-8
摘要
Abstract Visual navigation needs the agent locate the given target with visual perception. To enable robots to effectively execute tasks, combining large language models (LLMs) with multi-modal inputs in navigation is necessary. While LLMs offer rich semantic knowledge, they lack specific real-world information and real-time interaction capabilities. This paper introduces a Multi-modal Scene Graph (MMSG) navigation framework that aligns LLMs with visual perception models to predict next steps. Firstly, a multi-modal scene dataset is constructed, containing triplets of object-relations-target words. We provide target words and lists of existing objects in the scene to generate a large number of instructions and corresponding action plans for GPT $$-$$ - 3.5. The generated data is then utilized for pre-train LLM for path planning. During inference, we discover objects in the scene by extending the DETR visual object detector to multi-view RGB image collected from different reachable positions. Experimental results show that path planning generated from MMSG outperforms state-of-the-art methods, indicating its feasibility in complex environments. We evaluate our methods on the ProTHOR dataset and show superior navigation performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI