SUSTechApr 21, 2026arXiv:2604.19683

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Y. Lou, Yunfan Lou, Xiaowei Chi, Xiaojie Zhang, Zezhong Qian, Chengxuan Li, Rongyu Zhang, Yaoxu Lyu, Guoyu Song, Chuyao Fu, Haoxuan Xu, Pengwei Wang, Shanghang Zhang

AI Summary

The paper introduces Mask World Model (MWM), a video diffusion model that predicts the evolution of semantic masks instead of raw RGB pixels for robot policy learning. By predicting masks, MWM imposes a geometric information bottleneck, forcing the model to focus on essential physical dynamics and filter out irrelevant visual noise. Experiments on LIBERO, RLBench, and real-world settings show MWM outperforms RGB-based world models in generalization and robustness, even with random token pruning.

Key Contribution

Semantic masks are all you need: predicting mask dynamics in world models yields surprisingly robust and generalizable robot policies compared to predicting raw pixels.

Abstract

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Related Papers