Search papers, labs, and topics across Lattice.
The paper introduces Flow3r, a framework for visual geometry learning that leverages dense 2D correspondences (flow) as supervision, enabling training on large amounts of unlabeled monocular video. The core idea is to factor the flow prediction module into geometry latents from one image and pose latents from the other, which guides the learning of scene geometry and camera motion. Experiments demonstrate that Flow3r outperforms alternative designs and achieves state-of-the-art results on eight benchmarks, particularly excelling in dynamic, in-the-wild scenes.
Unlabeled monocular videos can now be used to train state-of-the-art 3D/4D reconstruction systems, thanks to a factored flow prediction approach that disentangles geometry and pose learning.
Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.