NVIDIATU MunichFeb 26, 2026arXiv:2602.23361

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein, Sven Elflein, Ruilong Li, Ruilong Li, Sérgio Agostinho, Sérgio Agostinho, Zan Gojcic, Zan Gojcic, Laura Leal-Taix'e, Laura Leal-Taixé, Qunjie Zhou, Qunjie Zhou, Aljosa Osep, Aljosa Osep

AI Summary

The paper introduces VGG-T$^3$, a novel feed-forward 3D reconstruction model that overcomes the quadratic scaling limitations of existing offline methods by distilling the varying-length Key-Value space representation into a fixed-size MLP using test-time training. This approach achieves linear scaling with respect to the number of input views, enabling the reconstruction of a 1k image collection in 54 seconds, a 11.6x speedup compared to softmax attention-based methods. The model also demonstrates superior point map reconstruction accuracy compared to other linear-time methods and exhibits visual localization capabilities.

Key Contribution

Ditch quadratic scaling in 3D reconstruction: VGG-T$^3$ achieves linear scaling and a 11.6x speed-up by distilling scene geometry into a fixed-size MLP.

Abstract

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References115

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

Related Papers