Mar 9, 2026arXiv:2603.08113

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, Yan Wang

AI Summary

The paper introduces SAMoE-VLA, a scene-adaptive Vision-Language-Action model for autonomous driving that addresses the instability and safety degradation observed when directly applying token-level Mixture-of-Experts (MoE) from LLMs to VLA models. SAMoE-VLA conditions expert selection on structured scene representations derived from bird's-eye-view (BEV) features, enabling scenario-dependent expert weighting. The model also incorporates a Conditional Cross-Modal Causal Attention mechanism for temporally consistent reasoning across modalities. Experiments on nuScenes and LangAuto benchmarks demonstrate state-of-the-art performance with fewer parameters compared to existing VLA and world-model-based approaches.

Key Contribution

Token-level Mixture-of-Experts, directly ported from LLMs, can actually *hurt* autonomous driving performance in VLA models; SAMoE-VLA fixes this with scene-adaptive expert selection, achieving SOTA results with fewer parameters.

Abstract

Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Related Papers