计算机科学
隐藏字幕
语言模型
人工智能
背景(考古学)
自然语言处理
答疑
生成模型
任务(项目管理)
自然语言
推论
困惑
机器学习
生成语法
图像(数学)
古生物学
管理
经济
生物
作者
Inhwan Bae,Junoh Lee,Hae‐Gon Jeon
标识
DOI:10.1109/tpami.2025.3582000
摘要
Recent advancements in language models have demonstrated its capacity of context understanding and generative representations. Leveraged by these developments, we propose a novel multimodal trajectory predictor based on a vision-language model, named VLMTraj, which fully takes advantage of the prior knowledge of multimodal large language models and the human-like reasoning across diverse modality information. The key idea of our model is to reframe the trajectory prediction task into a visual question answering format, using historical information as context and instructing the language model to make predictions in a conversational manner. Specifically, we transform all the inputs into a natural language style: historical trajectories are converted into text prompts, and scene images are described through image captioning. Additionally, visual features from input images are also transformed into tokens via a modality encoder and connector. The transformed data is then formatted to be used in a language model. Next, in order to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answers. For training, we first optimize a numerical tokenizer with the prompt data to effectively separate integer and decimal parts, allowing us to capture correlations between consecutive numbers in the language model. We then train our language model using all the visual question answering prompts. During model inference, we implement both deterministic and stochastic prediction methods through beam-search-based most-likely prediction and temperature-based multimodal generation. Our VLMTraj validates that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Extensive experiments show that VLMTraj can successfully understand social relationships and accurately extrapolate the multimodal futures on public pedestrian trajectory prediction benchmarks.
科研通智能强力驱动
Strongly Powered by AbleSci AI