NTUPKUUPennJun 16, 2026arXiv:2606.18439

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

AI Summary

This paper introduces RegimeVGGT, a method that enhances the Visual Geometry Grounded Transformer (VGGT) by implementing layer-wise spatially preserving redundancy removal to improve computational efficiency. By analyzing the role of different layers in cross-frame attention, the authors identify three distinct regimes that inform a targeted compression strategy, allowing for significant speed improvements without sacrificing reconstruction quality. The proposed approach achieves a remarkable 6.7x speedup over the original VGGT while maintaining the integrity of dense 3D scene reconstruction from multi-view images.

Key Contribution

Achieving a 6.7x speedup in 3D scene reconstruction without sacrificing quality could redefine efficiency benchmarks in visual geometry tasks.

Abstract

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

Related Papers