Multimodal fusion with vision-language-action models for robotic manipulation: A systematic review

计算机科学人工智能机器学习标杆管理稳健性（进化）水准点（测量）模块化设计机器人学融合机制任务（项目管理）软件部署传感器融合资源（消歧）人机交互融合协议（科学）在飞行中计算模型正确性群机器人分类学（生物学）模态（人机交互）

作者

Muhayy Ud Din,Waseem Akram,Lyes Saad Saoud,Jan Rosell,Irfan Hussain

出处

期刊：Information Fusion [Elsevier BV]
日期：2025-12-16 卷期号：129: 104062-104062

链接

handle.netdoi.org

标识

DOI：10.1016/j.inffus.2025.104062

摘要

• Provides a unified taxonomy that organizes more than 100 VLA architectures. • Maps 26 major VLA datasets using a framework based on task difficulty and modality richness. • Presents a large-scale quantitative analysis linking model design choices to normalized performance. • Demonstrates that diffusion-based decoders and hierarchical fusion significantly improve manipulation success. • Introduces the VLA-FEB benchmark with new metrics for measuring multimodal fusion quality and alignment. • Proposes an agentic VLA framework where LLM planners verify and re-plan actions using uncertainty-driven feedback for self-improving robotic autonomy. Vision Language Action (VLA) models represent a new frontier in robotics by unifying perception, reasoning, and control within a single multimodal learning framework. By integrating visual, linguistic, and action modalities, they enable multimodal fusion systems designed for instruction-driven manipulation and generalist autonomy. This systematic review synthesizes the state of the art in VLA research with an emphasis on architectures, algorithms, and applications relevant to robotic manipulation. We examine 102 models, 26 foundational datasets, and 12 simulation platforms, categorizing them according to their fusion strategies and integration mechanisms. Foundational datasets are evaluated using a novel criterion based on task complexity, modality richness, and dataset scale, allowing a comparative analysis of their suitability for generalist policy learning. We further introduce a structured taxonomy of fusion hierarchies and encoder-decoder families, together with a two-dimensional dataset characterization framework and a meta-analytic benchmarking protocol that quantitatively links design variables to empirical performance across benchmarks. Our analysis shows that hierarchical and late fusion architectures achieve the highest manipulation success and generalization, confirming the benefit of multi-level cross-modal integration. Diffusion-based decoders demonstrate superior cross-domain transfer and robustness compared to autoregressive heads. Dataset analysis highlights a persistent lack of benchmarks that combine high-complexity, multimodal, and long-horizon tasks, while existing simulators offer limited multimodal synchronization and real-to-sim consistency. To address these gaps, we propose the VLA Fusion Evaluation Benchmark to quantify fusion efficiency and alignment. Drawing on both academic and industrial advances, the review outlines future research directions in adaptive and modular fusion architectures, computational resource optimization, and the deployment of interpretable, resource-efficient robotic systems. We further propose a forward-looking agentic VLA paradigm where LLM planners integrate VLA skills as verifiable tools within a closed feedback loop for adaptive and self-improving robotic control. This work provides both a conceptual foundation and a quantitative roadmap for advancing embodied intelligence through multimodal information fusion across robotic domains. A public repository summarizing models, datasets, and simulators is available at: https://muhayyuddin.github.io/VLAs/ .

求助该文献

Multimodal fusion with vision-language-action models for robotic manipulation: A systematic review

今日热心研友