KAISTKorea UJun 1, 2026arXiv:2606.02479

Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee, Hyunwoo J. Kim

AI Summary

This paper introduces Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a novel framework that enhances long-horizon autoregressive video generation by optimizing memory retrieval through depth-based coverage maps. By leveraging pretrained 3D priors, COVRAG efficiently selects historical frames that maximize the coverage of target-view regions, addressing the limitations of existing methods that either oversimplify or overcomplicate the geometric evidence used. Experimental results on RealEstate10K and DL3DV10K demonstrate that COVRAG significantly improves geometric consistency in video generation while maintaining low latency compared to traditional approaches.

Key Contribution

COVRAG achieves superior long-term geometric consistency in video generation by intelligently maximizing coverage of target views, outperforming existing methods in both efficiency and effectiveness.

Abstract

Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.

Computer Vision Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

Related Papers