MistralJun 4, 2026arXiv:2606.05773

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang

AI Summary

This paper introduces PiL-World, a novel chunk-wise world model that facilitates closed-loop evaluation of vision-language-action (VLA) policies in robotic tasks. By generating multi-view future observations conditioned on previous actions and observations, PiL-World significantly enhances the fidelity of imagined rollouts compared to traditional open-loop models. The key finding is that PiL-World reduces the discrepancy between real-world success rates and those estimated through closed-loop evaluation from 63.2% to 12.0%, demonstrating its effectiveness in aligning simulated and actual robot performance.

Key Contribution

PiL-World slashes the gap between simulated and real-world VLA success rates by over 50%, revolutionizing how we evaluate robotic policies.

Abstract

Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

Related Papers