Social Reasoning-Aware Trajectory Prediction via Multimodal Language Model

计算机科学隐藏字幕语言模型人工智能背景（考古学）自然语言处理答疑生成模型任务（项目管理）自然语言推论困惑机器学习生成语法图像（数学）古生物学管理经济生物

作者

Inhwan Bae,Junoh Lee,Hae‐Gon Jeon

出处

期刊：IEEE Transactions on Pattern Analysis and Machine Intelligence [IEEE Computer Society]
日期：2025-01-01 卷期号：: 1-18

链接

nih.govdoi.org

标识

DOI：10.1109/tpami.2025.3582000

摘要

Recent advancements in language models have demonstrated its capacity of context understanding and generative representations. Leveraged by these developments, we propose a novel multimodal trajectory predictor based on a vision-language model, named VLMTraj, which fully takes advantage of the prior knowledge of multimodal large language models and the human-like reasoning across diverse modality information. The key idea of our model is to reframe the trajectory prediction task into a visual question answering format, using historical information as context and instructing the language model to make predictions in a conversational manner. Specifically, we transform all the inputs into a natural language style: historical trajectories are converted into text prompts, and scene images are described through image captioning. Additionally, visual features from input images are also transformed into tokens via a modality encoder and connector. The transformed data is then formatted to be used in a language model. Next, in order to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answers. For training, we first optimize a numerical tokenizer with the prompt data to effectively separate integer and decimal parts, allowing us to capture correlations between consecutive numbers in the language model. We then train our language model using all the visual question answering prompts. During model inference, we implement both deterministic and stochastic prediction methods through beam-search-based most-likely prediction and temperature-based multimodal generation. Our VLMTraj validates that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Extensive experiments show that VLMTraj can successfully understand social relationships and accurately extrapolate the multimodal futures on public pedestrian trajectory prediction benchmarks.

求助该文献

最长约 10秒，即可获得该文献文件

Social Reasoning-Aware Trajectory Prediction via Multimodal Language Model

今日热心研友