Kyungpook National UniversityYeungnam UniversityJan 9, 2026

MambaVLA: A Scalable and Efficient Vision-Language-Action Model with State Space Architecture

Sai Navaneet Peddapalli, Manisha Lingala, Sangmoon Lee, Ju H. Park

AI Summary

The paper introduces MambaVLA, a Vision-Language-Action (VLA) framework leveraging the Mamba state space architecture to address scalability and efficiency limitations of Transformer-based VLA models. MambaVLA integrates an Eagle visual encoder and a Qwen-7B-Chat-Int4 language model, achieving efficient multimodal fusion with linear-time complexity. By incorporating a diffusion flow matching module to align visual-language embeddings with continuous action trajectories, MambaVLA achieves comparable or superior performance to Transformers on VLA benchmarks with significantly reduced computational cost and faster inference.

Key Contribution

Ditch the Transformers: MambaVLA proves state space architectures can match or beat them in vision-language-action tasks, but with a fraction of the compute.

Abstract

Recent advances in multimodal learning have enabled powerful Vision–Language–Action (VLA) systems for robotic reasoning and control. However, most existing approaches rely on Transformer backbones, which face scalability and efficiency bottlenecks for long sequences. This work introduces MambaVLA, a scalable VLA framework built on the Mamba state space architecture for efficient sequence modeling. The framework integrates the Eagle visual encoder and Qwen-7B-Chat-Int4 language model to achieve fine-grained multimodal fusion with linear-time complexity. A diffusion flow matching module further aligns visual–language embeddings with continuous action trajectories, enabling smooth and precise control. Extensive evaluations on standard VLA benchmarks demonstrate that MambaVLA matches or surpasses Transformer based models while offering substantially lower computational cost and faster inference. These results highlight the potential of state space modeling and flow-based action generation for compact, scalable, and deployable embodied intelligence systems. https://sainavaneet.github.io/MambaVLA.gihub.io/

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueConsumer Communications and Networking Conference

Related Papers

Finding related papers...

Search

MambaVLA: A Scalable and Efficient Vision-Language-Action Model with State Space Architecture

Related Papers