Search papers, labs, and topics across Lattice.
Northeastern University, D VAE. 5.2 Q2: Can WoVR Effectively Improve VLA Task Performance? We next evaluate whether the proposed world model can effectively support reinforcement learning and lead to improved task performance of VLA policies. Beyond world model fidelity, this experiment directly assesses the practical value of WoVR as a simulator for policy optimization. Experimental Setup. We conduct policy optimization experiments on multiple LIBERO task suites, including the Spatial, Object, Goal, and Long suites [26]. As the base policy, following SimpleVLA-RL [20], we initialize from OpenVLA-OFT and perform one-trajectory supervised fine-tuning. Each LIBERO suite contains 10 tasks. For each suite, we allocate a total real-environment rollout budget of 2,500 trajectories. We first collect 1,500 trajectories (150 per task) using the base VLA policy to train the initial world model WMBase\mathrm{WM}_{\mathrm{Base}}. After the first stage of policy optimization in imagination, we collect an additional 1,000 trajectories under the updated policy and use them to refine the world model WMEvo\mathrm{WM}_{\mathrm{Evo}}, aligning the simulator with the evolving policy distribution. To balance alignment quality and computational efficiency, we perform only a single co-evolution step in practice, resulting in one refinement from WMBase\mathrm{WM}_{\mathrm{Base}} to WMEvo\mathrm{WM}_{\mathrm{Evo}} rather than multiple iterative alternations. To ensure a fair comparison, all methods are allocated the same real-environment rollout budget of 2,500 trajectories per suite. For world-model-based methods, including WMPO and WoVR, these trajectories are used exclusively for world model training and refinement, while policy optimization is conducted entirely within the learned world model via imagined rollouts, without further interaction with the ground-truth simulator. In contrast, GRPO directly interacts with the ground-truth simulator and consumes the same 2,500-trajectory budget for on-policy policy optimization. Table 2 summarizes success rates across LIBERO suites under this shared simulator-trajectory budget: GRPO uses the 2,500 trajectories for on-policy optimization, whereas WMPO and WoVR use them only to train the world model and optimize the policy via imagined rollouts. Quantitative Results. Table 2: Task success rates (%) across LIBERO task suites. The base policy is OpenVLA-OFT trained with one-trajectory supervised fine-tuning (one-trajectory SFT). All methods use 2,500 trajectories collected from the ground-truth simulator. GRPO (Online) consumes them for on-policy interaction, whereas WMPO and WoVR use them only for world model training and perform policy optimization via imagined rollouts. Improvements are shown in parentheses relative to the base policy. Spatial Object Goal Long Avg ↑\uparrow OpenVLA-OFT-base [17] 61.5 36.3 48.2 13.7 39.9 GRPO (online) [9] 66.6 45.1 52.1 14.5 44.6 WMPO [57] 67.8 48.0 54.6 13.7 46.2 WoVR (Ours) 81.5 (+20.0) 82.0 (+45.7) 77.5 (+29.3) 35.8 (+22.1) 69.2 (+29.3) Table 2 reports task success rates across LIBERO suites under a shared simulator-trajectory budget. The base policy achieves moderate performance, reflecting the limitations of imitation learning under sparse rewards and limited demonstrations. While GRPO improves over the base policy, its gains come at a high interaction cost. In practice, each policy update requires close to a thousand additional simulator trajectories, making the optimization process sample-inefficient under realistic interaction constraints. This highlights the fundamental limitation of purely online reinforcement learning in data-scarce robotic settings. WMPO further improves performance on short- and medium-horizon suites (Spatial, Object, and Goal), demonstrating that world-model-based optimization can provide benefits beyond online interaction. However, WMPO does not achieve performance gains on the LIBERO-Long suite, which consists of longer-horizon tasks. In these tasks, rollout instability in later stages of autoregressive generation degrades policy optimization, resulting in no improvement over the base policy. In contrast, WoVR consistently achieves the highest success rates across all evaluated suites. Notably, WoVR improves performance by +20.0% on Spatial, +45.7% on Object, +29.3% on Goal, and +22.1% on the long-horizon LIBERO-Long suite compared to the base policy. On average, WoVR achieves a success rate of 69.2%, substantially outperforming both GRPO (44.6%) and WMPO (46.2%). These results indicate that the improved stability and controllability of the proposed world model directly translate into more effective policy optimization. In particular, the strong performance on long-horizon tasks highlights that suppressing error accumulation in imagined rollouts is critical for reliable reinforcement learning with learned simulators. 5.3 Q3: Do Policies Optimized with WoVR Reliably Transfer to the Real World? Finally, we evaluate whether policies optimized with WoVR reliably transfer to real-world robotic manipulation tasks. Experimental Setup Our experiments are conducted on a Franka Emika Panda robot. We consider two contact-rich manipulation tasks: (i) Pick Banana, which requires picking a banana and placing it onto a plate, and (ii) Pick Bread, which requires picking a bread item and placing it onto a designated bread marker. For each task, we collect 10 teleoperated demonstrations to pre-train the base VLA policy, and additionally collect 150 rollouts from the base policy to train the world model. After training, we deploy the resulting policies on the physical robot and evaluate success rates over 30 independent trials per task. Figure 6: Real-world setup on a Franka Panda for Pick Banana and Pick Bread. Quantitative Results. Table 3 reports the real-world success rates. On Pick Banana, the base policy achieves a success rate of 46.67% (14/30), while WoVR improves it to 93.3% (28/30). On Pick Bread, WoVR increases the success rate from 76.67% (23/30) to 90.0% (27/30). These results demonstrate that WoVR delivers consistent real-world gains over imitation learning without requiring additional online interaction during policy optimization, indicating strong sim-to-real transfer of the optimized behaviors. Table 3: Real-world success rates (%, 30 trials per task) on a Franka Panda robot. Improvements are shown in parentheses relative to the base policy. Method Pick Banana Pick Bread Avg OpenVLA-OFT-base 46.7 (14/30) 76.7 (23/30) 61.7 WoVR (Ours) 93.3 (28/30) (+46.6) 90.0 (27/30) (+13.3) 91.7 (+30.0) 6 Ablation Study 6.1 Ablation on World Model Mechanisms We first conduct ablation studies on the core design choices of the proposed world model, aiming to understand how different context modeling mechanisms affect long-horizon video generation stability. Specifically, we investigate the following factors: (i) the number of memory frames used as visual context, (ii) the use of a fixed reference frame, and (iii) the effect of adding noise to context frames during training. Experimental Variants. We compare the full WoVR model against three ablated variants: • WoVR w/o ref, which removes the fixed reference frame from the context window; • WoVR w. mem=1, which uses only a single-frame context; • WoVR w/o noisy context, which disables noise injection on context frames during training. All variants are trained and evaluated on the LIBERO-Spatial suite only. We train the world model using 1,500 VLA rollout trajectories and evaluate on a held-out set of 24 trajectories. Quantitative Results. Table 4 reports the quantitative results measured by LPIPS, FID, and FVD under different rollout horizons. Compared to using a single-frame context, employing a multi-frame context with a fixed reference anchor significantly improves performance across all metrics. Table 4: Ablation study on world model mechanisms (LIBERO-Spatial). Rollout denotes the rollout horizon length. Metrics Method Rollout LPIPS ↓\downarrow FID ↓\downarrow FVD ↓\downarrow FloLPIPS ↓\downarrow WoVR (Ours) 512 0.0910.091 36.68736.687 73.49373.493 0.1540.154 256 0.0690.069 27.23827.238 63.94863.948 0.1100.110 128 0.0510.051 20.78020.780 49.01749.017 0.0810.081 WoVR w/o ref 512 0.1330.133 73.94273.942 123.502123.502 0.1680.168 256 0.0890.089 49.40649.406 86.00086.000 0.1160.116 128 0.0640.064 35.55935.559 86.14686.146 0.0900.090 WoVR w. mem=1 512 0.1200.120 64.50164.501 86.04286.042 0.1650.165 256 0.0860.086 46.79046.790 81.74281.742 0.1170.117 128 0.0650.065 36.04736.047 79.60579.605 0.0950.095 WoVR w/o noisy context 512 0.0990.099 44.71244.712 77.28477.284 0.1600.160 256 0.0740.074 31.69131.691 61.66061.660 0.1150.115 128 0.0540.054 23.44423.444 58.83658.836 0.0850.085 Figure 7: Qualitative ablation results on LIBERO-Spatial. Ablated variants exhibit error accumulation and visual drift under long-horizon rollouts, while the full WoVR model remains stable and consistent with the ground truth. To better understand the failure modes behind these quantitative trends, we provide qualitative comparisons in Fig. 7. As shown in the figure, models without a fixed reference frame or noisy context exhibit noticeable spatial drift and object disappearance over long-horizon rollouts, whereas the full WoVR model remains visually stable and consistent with the ground truth. Removing the reference frame leads to a clear degradation in performance, especially under longer rollout horizons. This result suggests that anchoring the context with a fixed reference frame effectively suppresses error accumulation in the autoregressive feedback loop, which is critical for maintaining stability in long-horizon video generation. Furthermore, disabling noise injection on context frames also results in noticeable performance drops. While the degradation is moderate for short rollouts, the gap becomes more pronounced as the rollout length increases. This observation indicates that adding mild noise to context frames improves robustness in long-horizon generation by reducing over-reliance on precise conditioning inputs, thereby alleviating the train–inference gap. Overall, these results demonstrate that the proposed context modeling strategy—combining a fixed reference frame, a multi-frame memory window, and noisy context augmentation—plays a crucial role in stabilizing long-horizon video generation. Together, these mechanisms enable WoVR to maintain high fidelity and temporal consistency under closed-loop autoregressive inference, providing a more reliable simulator for downstream reinforcement learning. 6.2 Ablation on Policy Optimization Mechanisms We next ablate key components in the policy optimization pipeline of WoVR, aiming to understand how different design choices affect downstream VLA task performance. In particular, we focus on mechanisms that facilitate stable policy learning and effective utilization of the learned world model. Experimental Setup. All experiments are conducted on the LIBERO-Spatial suite, with the same training protocol, data budget, and evaluation procedure as described in Sec. 5.2. Specifically, the base VLA policy is pre-trained following the same demonstration setup as in Q2, and policy optimization is performed using world-model-based reinforcement learning. Task performance is measured by the average success rate over the LIBERO-Spatial tasks. We compare the full WoVR framework against the following ablated variants: • WoVR w/o KIR, which removes keyframe-based initialization and starts policy optimization from randomly sampled initial states in the world model; • WoVR w/o PACE, which disables the co-evolution of the world model with the updated policy and keeps the world model fixed during policy optimization. Quantitative Results. Table 5 reports the success rates on the LIBERO-Spatial suite. The full WoVR framework achieves the highest performance, with an average success rate of 0.815. Removing keyframe-based initialization leads to a noticeable drop in performance, reducing the success rate to 0.782. This result indicates that KIR plays an important role in stabilizing early-stage policy learning by providing meaningful initial states. Disabling the co-evolution of the world model further degrades performance to 0.71. This suggests that continuously refining the world model with updated policy rollouts is critical for maintaining simulator accuracy and preventing compounding model errors during policy optimization. Table 5: Ablation on policy optimization mechanisms on LIBERO-Spatial. Success rate is averaged over all tasks in the suite. Method Success Rate ↑\uparrow WoVR (Ours) 0.815 WoVR w/o KIR 0.782 WoVR w/o PACE 0.710 7 Conclusion In this work, we revisited world-model-based reinforcement learning for VLA policies through the lens of reliability. Rather than assuming a learned world model to be a faithful simulator, we identified hallucination under closed-loop imagined interaction as the central obstacle: autoregressive error accumulation and policy-induced distribution shift can systematically corrupt optimization signals, causing reinforcement learning to exploit model inaccuracies instead of genuine task progress. To make RL in imagination viable under imperfect dynamics, we introduced WoVR, a hallucination-aware framework that controls hallucination at three interconnected levels. First, we strengthen the simulator itself by building a rollout-stable, action-controllable video world model, improving long-horizon consistency under policy-driven generation. Second, because residual prediction errors are unavoidable, we reshape the interaction protocol with Keyframe-Initialized Rollouts (KIR) to reduce the effective error depth and concentrate learning on task-critical segments where dynamics must be correct. Third, to prevent the evolving policy from drifting out of the simulator’s training distribution, we maintain policy–simulator alignment via PACE, a policy-aligned co-evolution strategy that mitigates distribution mismatch without requiring continuous online supervision. Extensive experiments on LIBERO and real-world manipulation tasks demonstrate that WoVR enables stable long-horizon imagined rollouts and effective on-policy optimization, yielding substantial gains over imitation learning and reliable transfer to physical robots. Overall, our results suggest that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly regulated by design, interaction, and alignment. Nevertheless, WoVR reduces but does not fully eliminate hallucination, particularly in extremely long-horizon or highly contact-sensitive settings, and it still relies on learned reward modeling and limited real-data refinement, leaving broader reliability guarantees as an open direction for future work. References [1] A. Bagchi, Z. Bao, H. Bharadhwaj, Y. Wang, P. Tokmakov, and M. Hebert (2026) Walk through paintings: egocentric world models from internet priors. arXiv preprint arXiv:2601.15284. Cited by: §2.2. [2] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. (2024) π0\pi_{0}: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164. Cited by: §1. [3] B. Chen, D. M. Monso, Y. Du, M. Simchowitz, R. Tedrake, and V. Sitzmann (2024) Diffusion forcing: next-token prediction meets full-sequence diffusion. arXiv preprint arXiv:2407.01392. Cited by: §4.1. [4] K. Chen, Z. Liu, T. Zhang, Z. Guo, S. Xu, H. Lin, H. Zang, X. Li, Q. Zhang, Z. Yu, G. Fan, T. Huang, Y. Wang, and C. Yu (2026) πRL\pi_{\texttt{RL}}: Online rl fine-tuning for flow-based vision-language-action models. External Links: 2510.25889, D VAE for spatiotemporal latent encoding, whereas OpenSora typically relies on more sampling steps and a, H. Li is with the Department of Data Science and Artificial Intelligence, Monash University, Melbourne, Australia. H. Li is also with the ARC Centre of Excellence for the Weather of the 21st CenturyH. Wang and J. Shen are with the School of Computing and Information Technology, University of Wollongong, Wollongong, Australia.Y. Lin and Y. Xu are with the Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology (Shenzhen), Shenzhen, China.X. Luo is with the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, ChinaQ. Zhu is with the College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, ChinaJ. Shi is with the Centre for Nutrition and Food Sciences, The University of Queensland, Brisbane, AustraliaH. Chen is with the School of Electrical and Computer Engineering, University of Sydney, Sydney, AustraliaB. Du is with the Department of Management, Griffith University, Brisbane, AustraliaJ. Barthelemy is with the NVIDIA, Santa Clara, USAZ. Xue is with The University of New South Wales, Sydney, AustraliaCorresponding author: Jun Shen, email: jshen@uow.edu.au and Yong Xu, email: laterfall@hit.edu.cn
NVIDIA Research1
0
3
1
Current image generation unlearning methods are surprisingly brittle: adversarial image prompts, optimized with attention-guided masking, can effectively resurrect supposedly "forgotten" concepts.