Chao Yu

Tivalidmin⁡(ρt(i)(θ)A^(i),clip(ρt(i)(θ),1−ϵ,1+ϵ)A^(i))].J_{\mathrm{GRPO}}(\theta)=\mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{T_{i}^{\mathrm{valid}}}\sum_{t=1}^{T_{i}^{\mathrm{valid}}}\min\!\left(\rho_{t}^{(i)}(\theta)\hat{A}^{(i)},\mathrm{clip}\!\left(\rho_{t}^{(i)}(\theta),1-\epsilon,1+\epsilon\right)\hat{A}^{(i)}\right)\right]. (11) This objective also complements KIR: keyframe-initialized rollouts tend to reach task resolution with fewer valid steps, and trajectory-length normalization increases their per-timestep contribution, so gradients are dominated by short, task-critical segments rather than long, drift-prone continuations. 4.3 PACE: Policy–Aligned Co-Evolution While policy optimization proceeds entirely within the learned world model, the policy’s action distribution continuously evolves and drifts away from the data used to train the initial world model. This inherent distribution shift leads to accumulating mismatch between the simulator and the improving policy, ultimately degrading the reliability of imagined rollouts. To address this issue, we introduce PACE, a World Model–Policy co-evolution strategy. Instead of treating the world model as a fixed, static simulator throughout policy optimization, PACE allows the world model and VLA policy to evolve together throughout training. Concretely, we realize this co-evolution through low-frequency, policy-driven refinement: we first train an initial world model, denoted as WMBase\mathrm{WM}_{\mathrm{Base}}, using trajectories collected from the base VLA policy. After the first stage of policy optimization within WMBase\mathrm{WM}_{\mathrm{Base}}, we collect a limited set of additional rollouts under the evolved policy and use them to further refine the world model. The refined model is referred to as WMEvo\mathrm{WM}_{\mathrm{Evo}}. Importantly, this refinement is performed only once (or at very low frequency), distinguishing PACE from classical model-based reinforcement learning methods, which continuously update the dynamics model at high frequency during policy optimization. This low-frequency refinement provides two key advantages. First, unlike real-world online RL, it does not require continuous human supervision or environment resets during policy training, significantly reducing operational overhead. Second, by aligning the world model with the evolving policy distribution, PACE mitigates compounding model errors and maintains simulator reliability without sacrificing training stability. System Implementation. We build WoVR on top of RLinf [49] to support efficient distributed imagined rollouts and training. Concretely, we replace RLinf’s environment back-end with our learned world model, enabling scalable closed-loop rollouts without a ground-truth simulator. GPU allocation details are provided in Appendix A. 5 Experiments We conduct extensive experiments to evaluate the effectiveness of WoVR as a world-model-based reinforcement learning framework for post-training VLA policies. Our experimental design aims to systematically answer the following three questions: • Q1: Is the proposed world model stable, controllable, and efficient enough to serve as a simulator for closed-loop reinforcement learning? • Q2: Can WoVR effectively improve VLA task performance compared to existing world-model-based reinforcement learning methods? • Q3: Do the policies optimized with WoVR reliably transfer to real-world robotic manipulation tasks? To answer these questions, we evaluate both the quality of the learned world model and the downstream policy performance. For world model evaluation, we focus on long-horizon, action-conditioned video generation under closed-loop, chunk-by-chunk autoregressive inference. We adopt standard perceptual and distributional metrics, including LPIPS [53], FID [12], FVD [42] and FloLPIPS [7]. Specifically, LPIPS (Learned Perceptual Image Patch Similarity) measures frame-level perceptual similarity using deep feature representations; FID (Fréchet Inception Distance) evaluates distributional similarity between generated and real frames via feature statistics; FVD (Fréchet Video Distance) extends this comparison to the temporal domain to assess video-level realism and motion consistency; and FloLPIPS measures motion-aligned perceptual similarity along estimated optical flow trajectories, emphasizing temporal coherence under action-conditioned dynamics. Since the world model is intended to be used as a simulator for closed-loop on-policy reinforcement learning, we also report inference throughput (frames per second) to quantify generation efficiency. For policy evaluation, we use task success rate (SR) as the primary metric, reflecting the sparse-reward setting commonly encountered in real-world robotic manipulation. All success rates are computed over multiple independent rollouts with fixed initial conditions. We compare WoVR against several representative baselines spanning both world model quality and policy optimization. For world model quality, we include EVAC [15], which conditions generation on absolute end-effector actions, as well as Cosmos-Predict2 [35] and OpenSora [34], the latter serving as the world-model backbone adopted in WMPO [57]. All compared models are evaluated under the same chunk-wise autoregressive generation protocol to ensure a fair comparison. For policy optimization, we consider the following baselines: • OpenVLA-OFT-base [17]: a base VLA policy trained purely with imitation learning; • GRPO (Online) [9]: trained with real-environment interaction under the same rollout budget; • WMPO [57]: performs reinforcement learning using an OpenSora-based world model. All world-model-based methods are trained until convergence within their respective simulators, while GRPO is reported under the same rollout budget to ensure a fair comparison. All experiments are conducted using eight NVIDIA H100 GPUs. 5.1 Q1: Is the World Model Stable, Controllable, and Efficient? We first investigate whether the proposed world model is sufficiently stable, controllable, and efficient to serve as a simulator for closed-loop reinforcement learning. In particular, we focus on long-horizon, action-conditioned video generation under chunk-by-chunk autoregressive inference, where modeling errors may accumulate and severely affect downstream policy optimization. Experimental Setup. We conduct all world model evaluations in the LIBERO environment [26]. A total of 3,000 VLA rollout trajectories, each with a length of 512 frames, are collected to train the world models. In addition, 200 held-out trajectories of the same length are used exclusively for evaluation. We compare WoVR against three representative action-conditioned world models: EVAC, Cosmos-Predict2, and OpenSora as adopted in WMPO. Among them, EVAC conditions video generation on absolute end-effector actions, while Cosmos-Predict2, OpenSora, and WoVR all use residual action representations. During evaluation, all baseline models follow the same chunk-wise autoregressive generation protocol to ensure a fair comparison. Specifically, each model predicts future video segments by conditioning on a 4-frame visual context together with an 8-step action chunk, and autoregressively generates the subsequent 8 frames. For the first chunk, where only a single initial image is available, the initial frame is replicated to fill the context window in order to align the inference procedure across methods. We quantitatively evaluate the generated rollouts by comparing predicted videos with ground-truth trajectories using standard video generation metrics, including LPIPS, FID, FVD and FloLPIPS. Table 1: World model quality, motion consistency, and efficiency comparison. Rollout denotes the rollout horizon length. Metrics Method Rollout FPS ↑\uparrow LPIPS [53] ↓\downarrow FID [12] ↓\downarrow FVD [42] ↓\downarrow FloLPIPS [7] ↓\downarrow EVAC [15] 512 2.7 0.1460.146 46.52846.528 345.818345.818 0.2050.205 256 0.1300.130 49.15349.153 354.983354.983 0.1920.192 128 0.1060.106 44.33744.337 423.132423.132 0.1660.166 Cosmos-Predict2 [35] 512 3.50 0.3150.315 165.862165.862 275.737275.737 0.2650.265 256 0.2260.226 106.324106.324 203.853203.853 0.3060.306 128 0.1640.164 77.55577.555 304.456304.456 0.2810.281 OpenSora [57] 512 7.00 0.1050.105 38.47838.478 89.39189.391 0.1560.156 256 0.0820.082 33.57733.577 94.99894.998 0.1220.122 128 0.0690.069 33.41333.413 111.643111.643 0.1130.113 WoVR (Ours) 512 23.0 0.091 34.252 68.011 0.154 256 0.063 24.378 50.041 0.102 128 0.047 18.553 39.047 0.079 Quantitative Results. Table 1 summarizes the quantitative comparison across different rollout horizons. As shown in the table, WoVR consistently outperforms EVAC, Cosmos-Predict2, and OpenSora across all evaluation metrics. In particular, WoVR achieves the lowest LPIPS, FID, FVD, and FloLPIPS scores at all tested rollout lengths, indicating higher visual fidelity, stronger temporal consistency, and more accurate action-conditioned dynamics. These improvements become more pronounced as the rollout horizon increases, suggesting that WoVR is more robust to error accumulation in long-horizon autoregressive generation. Despite adopting a larger backbone (Wan, ∼\sim, Shenghong He, Chao Yu, Danying Mo and Yucong Zhang are with the School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China (e-mail: yuchao3@mail.sysu.edu.cn) Yinqi Wei is with the Nanyang Technological University, Singapore

NVIDIA Research

Papers on Lattice

Total citations

Topics

h-index