Search papers, labs, and topics across Lattice.
The paper introduces MambaVLA, a Vision-Language-Action (VLA) framework leveraging the Mamba state space architecture to address scalability and efficiency limitations of Transformer-based VLA models. MambaVLA integrates an Eagle visual encoder and a Qwen-7B-Chat-Int4 language model, achieving efficient multimodal fusion with linear-time complexity. By incorporating a diffusion flow matching module to align visual-language embeddings with continuous action trajectories, MambaVLA achieves comparable or superior performance to Transformers on VLA benchmarks with significantly reduced computational cost and faster inference.
Ditch the Transformers: MambaVLA proves state space architectures can match or beat them in vision-language-action tasks, but with a fraction of the compute.
Recent advances in multimodal learning have enabled powerful Vision–Language–Action (VLA) systems for robotic reasoning and control. However, most existing approaches rely on Transformer backbones, which face scalability and efficiency bottlenecks for long sequences. This work introduces MambaVLA, a scalable VLA framework built on the Mamba state space architecture for efficient sequence modeling. The framework integrates the Eagle visual encoder and Qwen-7B-Chat-Int4 language model to achieve fine-grained multimodal fusion with linear-time complexity. A diffusion flow matching module further aligns visual–language embeddings with continuous action trajectories, enabling smooth and precise control. Extensive evaluations on standard VLA benchmarks demonstrate that MambaVLA matches or surpasses Transformer based models while offering substantially lower computational cost and faster inference. These results highlight the potential of state space modeling and flow-based action generation for compact, scalable, and deployable embodied intelligence systems. https://sainavaneet.github.io/MambaVLA.gihub.io/