Stanford HAIcore contributorsproject leads and equal contributionsApr 21, 2026arXiv:2604.19741

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler

AI Summary

CityRAG is introduced as a video generative model that grounds video generation to real-world locations by leveraging geo-registered data, enabling the simulation of real-world environments with varying conditions and object configurations. The model is trained on temporally unaligned data to disentangle the static scene from dynamic attributes like weather and lighting. Results show CityRAG generates coherent, long-duration, physically grounded video sequences, maintaining environmental conditions and achieving loop closure across complex trajectories.

Key Contribution

Generate navigable, 3D-consistent simulations of real-world locations with arbitrary weather and dynamic object configurations using only geo-registered video data.

Abstract

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

Computer Vision Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References73

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Related Papers