AgiBotFudanShanghai InnovationMay 31, 2026arXiv:2606.01027

$\tau_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Bin Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, S. Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo

AI Summary

The $\tau_0$-World Model ($\tau_0$-WM) integrates policy learning, video prediction, and action evaluation into a single framework for robotic manipulation, enabling the generation of executable actions while anticipating their future consequences. Utilizing a shared video diffusion backbone, the model predicts future visual latents and continuous action chunks from diverse inputs, including multi-view observations and language instructions. Trained on extensive real-robot teleoperation data, $\tau_0$-WM outperforms existing baselines on complex long-horizon manipulation tasks, showcasing its effectiveness in real-world applications.

Key Contribution

$\tau_0$-WM outperforms traditional models by seamlessly integrating action prediction and evaluation, leading to superior performance in complex robotic tasks.

Abstract

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $\tau_0$-World Model ($\tau_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $\tau_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $\tau_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $\tau_0$-WM shows superior performance over other relevant baselines.

Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

$\tau_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

Related Papers