Search papers, labs, and topics across Lattice.
The paper introduces DecMem, a decoupled memory architecture for video generation that addresses the limitations of computational inefficiency and attention dispersion in long-horizon extrapolation. DecMem uses Sparse Global Memory for efficient access to global history and Anchored Local Memory for stable, high-quality extrapolation. Experiments show DecMem significantly outperforms state-of-the-art methods, enabling minute-level controllable video generation with improved fidelity and consistency.
Generate minute-long, consistent videos with a novel memory architecture that leapfrogs existing methods by decoupling global and local memory access.
Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of na茂ve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.