计算机科学
可扩展性
人工智能
视觉空间
导线
卷积神经网络
感受野
计算复杂性理论
编码(集合论)
计算机视觉
算法
理论计算机科学
模式识别(心理学)
感知
大地测量学
集合(抽象数据类型)
数据库
神经科学
生物
程序设计语言
地理
作者
Yue Liu,Yunjie Tian,Yuzhong Zhao,Hongtian Yu,Lingxi Xie,Yaowei Wang,Qixiang Ye,Yunfan Liu
出处
期刊:Cornell University - arXiv
日期:2024-01-01
被引量:18
标识
DOI:10.48550/arxiv.2401.10166
摘要
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency. To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields. To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences. Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases. Source code has been available at https://github.com/MzeroMiko/VMamba.
科研通智能强力驱动
Strongly Powered by AbleSci AI