计算机科学
视觉推理
人工智能
模棱两可
自然语言
空间智能
可视化
可扩展性
跳跃式监视
感知
语言模型
认知建筑学
桥接(联网)
自然语言理解
一套
人机交互
交叉口(航空)
匹配(统计)
常识推理
自然语言处理
建筑
视觉语言
比例(比率)
路径(计算)
杠杆(统计)
因果推理
弹道
视觉感受
幻觉
作者
Ruijie Lu,Yiyang Ma,Xiaokang Chen,Lingxiao Luo,Zhiyu Wu,Zizheng Pan,Xingchao Liu,Yutong Lin,Hao Li,Wen Liu,Zhewen Hao,Xi Gao,Shaoheng Nie,Yixuan Wei,Zhenda Xie,Ting Chen,Gang Zeng
出处
期刊:CERN European Organization for Nuclear Research - Zenodo
日期:2026-04-30
标识
DOI:10.5281/zenodo.20125350
摘要
Despite the remarkable progress in Multimodal Large Language Models (MLLMs), the pre- vailing Chain-of-Thought (CoT) paradigms remain predominantly confined to the linguistic space. While recent advancements have focused on bridging the Perception Gap through high- resolution cropping (e.g., Thinking with Images), they overlook a more fundamental bottleneck: the Reference Gap. The inherent ambiguity of natural language often fails to provide precise, unambiguous pointers to complex spatial layouts, leading to logical collapse in tasks requiring rigorous grounding. In this work, we introduce Thinking with Visual Primitives, a novel reasoning framework that elevates spatial markers—such as points and bounding boxes—to “minimal units of thought”. By interleaving these visual primitives directly into the thinking process, our model can “point” while it “reasons”, effectively grounding its cognitive trajectory in the physical coordinates of the image. Notably, our framework is built on a highly optimized architecture with extreme visual token efficiency. Despite its compact model scale and signifi- cantly lower image-token budget, our model achieves frontier-competitive performance on a focused suite of challenging visual QA tasks, matching or exceeding models such as GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash. This demonstrates a path toward more efficient and scalable System-2-like multimodal intelligence.
科研通智能强力驱动
Strongly Powered by AbleSci AI