Jun 17, 2026arXiv:2606.18960

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

AI Summary

This paper introduces Mem-World, a memory-augmented multi-view action-conditioned world model designed to enhance persistent robot manipulation by addressing the challenges of end-effector occlusions and rapid camera motion. By leveraging a novel W-VMem architecture that anchors historical observations to evolving surface elements, the model improves the retrieval of relevant past frames based on future actions. Experimental results demonstrate that Mem-World not only generates more reliable rollouts in complex scenarios but also significantly improves policy evaluation and success rates in long-horizon tasks compared to existing models.

Key Contribution

Mem-World boosts robot manipulation success rates from 58% to 72% on long-horizon tasks by effectively retrieving relevant historical context during action prediction.

Abstract

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Related Papers