BAIRApr 9, 2026arXiv:2604.08532

Self-Improving 4D Perception via Self-Distillation

Nan Huang, Nan Huang, Pengcheng Yu, Pengchen Yu, Weijia Zeng, Weijia Zeng, James M. Rehg, J. Rehg, Angjoo Kanazawa, Angjoo Kanazawa, Haiwen Feng, Haiwen Feng

AI Summary

SelfEvo, a self-improving framework, leverages unlabeled videos to enhance pretrained multi-view reconstruction models for 4D perception. It introduces a self-distillation scheme based on spatiotemporal context asymmetry, where the model learns from its own predictions across time and different viewpoints. Experiments across eight benchmarks show SelfEvo consistently improves video depth estimation (up to 36.5%) and camera estimation (up to 20.1%) without ground truth labels, generalizing across different base architectures.

Key Contribution

Unlock 36% better video depth estimation and 20% better camera pose estimation by simply letting your model learn from its own unlabeled video predictions.

Abstract

Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $\pi^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.

Computer Vision Inference & Quantization Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References79

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Self-Improving 4D Perception via Self-Distillation

Related Papers