Search papers, labs, and topics across Lattice.
DAGE, a dual-stream transformer architecture, is introduced to address the challenges of high-resolution, long-sequence geometry and camera pose estimation from multi-view inputs. The architecture uses a low-resolution stream with frame/global attention for view consistency and camera estimation, and a high-resolution stream to preserve fine details. By fusing these streams with cross-attention, DAGE achieves state-of-the-art results in video geometry estimation and multi-view reconstruction while scaling effectively to 2K inputs.
Achieve state-of-the-art results in high-resolution video geometry estimation by disentangling global coherence and fine detail using a dual-stream transformer architecture.
Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.