Amazon ScienceGeorgia TechMar 17, 2026arXiv:2603.17117

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu, Runjia Qian, Yumeng Li, Liquang Wang, Liquan Wang, Songheng Yin, P. SriSiddarthChakaravarthy, D. Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg

AI Summary

The paper introduces Mosaic Memory (MosaicMem), a hybrid spatial memory system for video diffusion models that combines explicit 3D patch-based representations with implicit generative modeling. MosaicMem lifts image patches into 3D space for accurate localization and retrieval, while leveraging the diffusion model's conditioning to handle dynamic elements and prompt adherence. Experiments demonstrate that MosaicMem achieves superior pose adherence and dynamic modeling compared to purely implicit or explicit memory approaches, enabling long-horizon video generation and scene manipulation.

Key Contribution

Achieve minute-level navigable video world models by combining the strengths of explicit 3D patch memory with implicit generative modeling.

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Related Papers