Jun 10, 2026arXiv:2606.12403

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Zefu Lin, Rongxu Cui, Junjia Xu, Xiaojuan Jin, Wenling Li, Lue Fan, Zhaoxiang Zhang

AI Summary

This paper introduces World Pilot, a Vision-Language-Action (VLA) framework that enhances manipulation tasks by integrating World-Action Model (WAM) priors into the decision-making process. By employing two complementary pathways—Latent Steering for scene evolution and Action Steering for trajectory guidance—World Pilot significantly improves the model's ability to anticipate and adapt to dynamic environments. The framework achieves a state-of-the-art success rate of 84.7% on the LIBERO-Plus zero-shot out-of-distribution benchmark, outperforming existing models in various real-robot manipulation tasks, particularly under challenging conditions.

Key Contribution

World Pilot achieves an unprecedented 84.7% success rate in zero-shot manipulation tasks by integrating anticipatory scene and motion priors into VLA models.

Abstract

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Related Papers