Tsinghua AID observations intoHKUSTSmartMore Ltd.Apr 13, 2026arXiv:2604.11757

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

Jinhui Ye, Ning Gao, Ning Gao, Senqiao Yang, Senqiao Yang, Jinliang Zheng, Jinliang Zheng, Zixuan Wang, Zixuan Wang, Yuxin Chen, Pengguang Chen, Yilun Chen, Yilun Chen, Shu Liu, Shu Liu, Jiaya Jia

AI Summary

StarVLA-$\alpha$ is introduced as a simplified VLA baseline to systematically study design choices in vision-language-action models. By minimizing architectural complexity and pipeline engineering, the study re-evaluates action modeling, robot-specific pretraining, and interface engineering across multiple benchmarks. The resulting generalist model achieves competitive performance, outperforming baselines like $\pi_{0.5}$ by 20% on the real-world RoboChallenge, suggesting that a strong VLM backbone is sufficient for strong performance without complex architectures.

Key Contribution

A surprisingly simple VLA model, StarVLA-$\alpha$, beats more complex systems on real-world robotics tasks, suggesting that VLM backbones are more critical than intricate architectures.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$\alpha$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$\alpha$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $\pi_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$\alpha$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References68

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

StarVLA-$\alpha$: Reducing Complexity in Vision-Language-Action Systems

Related Papers