Tsinghua AIFudanUSTCFeb 26, 2026arXiv:2602.22960

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Tianxing Xu, Zixuan Wang, Zixuan Wang, Guangyuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Bang Zhang, Song-Hai Zhang, Song-Hai Zhang

AI Summary

The paper introduces UCM, a world model framework that unifies long-term memory and precise camera control by using a time-aware positional encoding warping mechanism. This approach addresses limitations in existing methods that struggle with long-term consistency and fine-grained control in unbounded scenarios. UCM employs a dual-stream diffusion transformer for efficient high-fidelity generation and is trained on a large dataset of 500K monocular videos curated using a point-cloud-based rendering strategy, demonstrating superior performance in scene consistency and camera controllability.

Key Contribution

Achieve both long-term scene consistency and precise camera control in world models with UCM, a novel framework sidestepping explicit 3D reconstruction.

Abstract

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References59

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Related Papers