作者
Muhayy Ud Din,Waseem Akram,Lyes Saad Saoud,Jan Rosell,Irfan Hussain
摘要
• Provides a unified taxonomy that organizes more than 100 VLA architectures. • Maps 26 major VLA datasets using a framework based on task difficulty and modality richness. • Presents a large-scale quantitative analysis linking model design choices to normalized performance. • Demonstrates that diffusion-based decoders and hierarchical fusion significantly improve manipulation success. • Introduces the VLA-FEB benchmark with new metrics for measuring multimodal fusion quality and alignment. • Proposes an agentic VLA framework where LLM planners verify and re-plan actions using uncertainty-driven feedback for self-improving robotic autonomy. Vision Language Action (VLA) models represent a new frontier in robotics by unifying perception, reasoning, and control within a single multimodal learning framework. By integrating visual, linguistic, and action modalities, they enable multimodal fusion systems designed for instruction-driven manipulation and generalist autonomy. This systematic review synthesizes the state of the art in VLA research with an emphasis on architectures, algorithms, and applications relevant to robotic manipulation. We examine 102 models, 26 foundational datasets, and 12 simulation platforms, categorizing them according to their fusion strategies and integration mechanisms. Foundational datasets are evaluated using a novel criterion based on task complexity, modality richness, and dataset scale, allowing a comparative analysis of their suitability for generalist policy learning. We further introduce a structured taxonomy of fusion hierarchies and encoder-decoder families, together with a two-dimensional dataset characterization framework and a meta-analytic benchmarking protocol that quantitatively links design variables to empirical performance across benchmarks. Our analysis shows that hierarchical and late fusion architectures achieve the highest manipulation success and generalization, confirming the benefit of multi-level cross-modal integration. Diffusion-based decoders demonstrate superior cross-domain transfer and robustness compared to autoregressive heads. Dataset analysis highlights a persistent lack of benchmarks that combine high-complexity, multimodal, and long-horizon tasks, while existing simulators offer limited multimodal synchronization and real-to-sim consistency. To address these gaps, we propose the VLA Fusion Evaluation Benchmark to quantify fusion efficiency and alignment. Drawing on both academic and industrial advances, the review outlines future research directions in adaptive and modular fusion architectures, computational resource optimization, and the deployment of interpretable, resource-efficient robotic systems. We further propose a forward-looking agentic VLA paradigm where LLM planners integrate VLA skills as verifiable tools within a closed feedback loop for adaptive and self-improving robotic control. This work provides both a conceptual foundation and a quantitative roadmap for advancing embodied intelligence through multimodal information fusion across robotic domains. A public repository summarizing models, datasets, and simulators is available at: https://muhayyuddin.github.io/VLAs/ .