Stanford HAIFeb 12, 2026arXiv:2602.12063

VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

Tony Lee, L. Shi, Jianyu Chen, Percy Liang, Chelsea Finn

AI Summary

This paper introduces VLAW, an iterative algorithm for co-improving vision-language-action (VLA) policies and action-conditioned video generation world models using real-world rollouts. VLAW leverages real-world data to refine the world model, which is then used to generate synthetic data for further policy improvement, addressing the limitations of world models trained solely on demonstration datasets. Experiments on a real robot demonstrate a 39.2% absolute improvement in success rate over the base policy, highlighting the effectiveness of the iterative co-improvement strategy.

Key Contribution

Closing the reality gap: iteratively refining a world model with real-world robot data yields a significant boost in vision-language-action policy performance.

Abstract

The goal of this paper is to improve the performance and reliability of vision-language-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator-specifically, an action-conditioned video generation model-can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla-w

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

Related Papers