Search papers, labs, and topics across Lattice.
This paper introduces Mem-World, a memory-augmented multi-view action-conditioned world model designed to enhance persistent robot manipulation by addressing the challenges of end-effector occlusions and rapid camera motion. By leveraging a novel W-VMem architecture that anchors historical observations to evolving surface elements, the model improves the retrieval of relevant past frames based on future actions. Experimental results demonstrate that Mem-World not only generates more reliable rollouts in complex scenarios but also significantly improves policy evaluation and success rates in long-horizon tasks compared to existing models.
Mem-World boosts robot manipulation success rates from 58% to 72% on long-horizon tasks by effectively retrieving relevant historical context during action prediction.
Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.