CMU MLTRIApr 21, 2026arXiv:2604.19728

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Jean-Pierre Mercat, Jean Mercat, Sedrick Scott Keh, Sedrick Keh, K. Arora, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu

AI Summary

VLA Foundry is introduced as an open-source framework unifying LLM, VLM, and VLA training, addressing the fragmented landscape of existing VLA efforts. The framework supports both from-scratch training and the use of pretrained backbones like Qwen3-VL, providing end-to-end control over the entire training pipeline. Experiments on LBM Eval demonstrate that models trained with VLA Foundry achieve performance comparable to prior closed-source work when trained from scratch, and significantly outperform baselines when using the Qwen3-VL backbone for tabletop manipulation tasks.

Key Contribution

End-to-end training of Vision-Language-Action models just got a whole lot easier: VLA Foundry unifies LLM, VLM, and VLA training in a single open-source framework.

Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

Multimodal Models Open-Source Models & Weights Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References72

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Related Papers