Search papers, labs, and topics across Lattice.
This paper introduces RegimeVGGT, a method that enhances the Visual Geometry Grounded Transformer (VGGT) by implementing layer-wise spatially preserving redundancy removal to improve computational efficiency. By analyzing the role of different layers in cross-frame attention, the authors identify three distinct regimes that inform a targeted compression strategy, allowing for significant speed improvements without sacrificing reconstruction quality. The proposed approach achieves a remarkable 6.7x speedup over the original VGGT while maintaining the integrity of dense 3D scene reconstruction from multi-view images.
Achieving a 6.7x speedup in 3D scene reconstruction without sacrificing quality could redefine efficiency benchmarks in visual geometry tasks.
Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.