Google ResearchMar 3, 2026arXiv:2603.03269

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Cheng Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun

AI Summary

LoGeR is introduced as a novel architecture for scaling dense 3D reconstruction to extremely long video sequences without post-optimization. It employs a chunk-based processing approach with bidirectional priors for intra-chunk reasoning and introduces a learning-based hybrid memory module to maintain coherence across chunk boundaries. The hybrid memory combines a parametric Test-Time Training (TTT) memory for global coordinate frame anchoring and a non-parametric Sliding Window Attention (SWA) mechanism for preserving uncompressed context, enabling generalization from training sequences of 128 frames to inference on sequences of thousands of frames.

Key Contribution

Achieve globally consistent 3D reconstruction over sequences exceeding 19,000 frames by combining test-time training with sliding window attention, outperforming prior state-of-the-art methods by over 74% on ATE on KITTI.

Abstract

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Related Papers