Search papers, labs, and topics across Lattice.
MoRe, a feedforward 4D reconstruction network, is introduced to efficiently recover dynamic 3D scenes from monocular videos by disentangling dynamic motion from static structure using an attention-forcing strategy. The model is fine-tuned on large-scale datasets and uses grouped causal attention to capture temporal dependencies and adapt to varying token lengths. Experiments show MoRe achieves high-quality dynamic reconstructions with exceptional efficiency compared to optimization-based methods.
Ditch the optimization: MoRe achieves real-time 4D scene reconstruction from monocular video using a feedforward transformer that disentangles motion and structure.
Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.