Mar 26, 2026arXiv:2603.25716

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Kai Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai

AI Summary

This paper introduces Hybrid Memory, a new approach for video world models that addresses the limitations of existing methods in handling dynamic subjects that move out of and back into view. They construct HM-World, a large-scale video dataset designed to evaluate hybrid memory capabilities, and propose HyDRA, a memory architecture that uses spatiotemporal relevance-driven retrieval to preserve the identity and motion of hidden subjects. Experiments on HM-World demonstrate that HyDRA significantly outperforms state-of-the-art approaches in dynamic subject consistency and overall generation quality.

Key Contribution

World models can now remember and realistically regenerate dynamic objects that temporarily disappear from view, thanks to a novel hybrid memory architecture.

Abstract

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality. Code is publicly available at https://github.com/H-EmbodVis/HyDRA.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Related Papers