AI2ByteDanceUNCFeb 16, 2026arXiv:2602.14941

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Mohit Bansal

AI Summary

The paper introduces AnchorWeave, a memory-augmented video generation framework designed to improve spatial world consistency in camera-controllable video generation by addressing misalignment issues in global 3D scene reconstruction. AnchorWeave replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile cross-view inconsistencies through coverage-driven local memory retrieval and a multi-anchor weaving controller. Experiments demonstrate that AnchorWeave significantly improves long-term scene consistency and visual quality, validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

Key Contribution

Ditch the messy global 3D scene reconstruction: AnchorWeave weaves together clean, local geometric memories for camera-controllable video generation, boosting long-term consistency and visual quality.

Abstract

Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Related Papers