Search papers, labs, and topics across Lattice.
The paper introduces 4RC, a feed-forward framework for 4D reconstruction from monocular videos that jointly models dense scene geometry and motion dynamics. 4RC uses a transformer backbone to encode the entire video into a spatio-temporal latent space, enabling a conditional decoder to query 3D geometry and motion for any frame at any timestamp. The method represents per-view 4D attributes by decomposing them into base geometry and time-dependent relative motion, achieving state-of-the-art performance across various 4D reconstruction tasks.
Unlock real-time 4D scene understanding from monocular video with a novel "encode-once, query-anywhere and anytime" framework that jointly models geometry and motion.
We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.