DAMONingboTech UniversitySJTUSoul AI LabZJUMay 28, 2026arXiv:2605.30060

Towards Consistent Video Geometry Estimation

AI Summary

ViGeo, a feed-forward transformer, is introduced for spatially dense and temporally consistent video geometry estimation, supporting streaming, full-sequence, and long-video inference. The core innovation is dynamic chunking attention, enabling the model to adapt its attention pattern at test time by training on both bidirectional and causal temporal contexts. A completion-based data refinement framework further enhances supervision by training a video depth completion teacher to generate dense, coherent training targets from sparse annotations.

Key Contribution

A single feed-forward transformer now achieves state-of-the-art performance across diverse video geometry estimation tasks, rivaling specialized architectures.

Abstract

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References94

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Towards Consistent Video Geometry Estimation

Related Papers