StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Junwon Seo, Sushant Veer, Ran Tian, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy

AI Summary

This paper introduces StressDream, a novel approach that enhances video world models (WMs) for more effective policy evaluation and improvement by steering imaginations toward high-impact outcomes specified at inference time. By optimizing the initial noise of diffusion-based WMs, StressDream balances the need for plausible future observations with the identification of critical, yet rare, events that could affect policy performance. The results demonstrate that StressDream significantly improves the robustness of policy evaluations in autonomous driving and robotic manipulation by identifying actions that lead to undesirable outcomes, thus enabling safer and more effective decision-making.

Key Contribution

Steering imaginations in video world models can reveal critical failure points in robotic actions that traditional methods might overlook.

Abstract

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Related Papers