Mar 10, 2026arXiv:2603.09385

EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, Hui Xiong

AI Summary

EventVGGT addresses the challenge of temporally inconsistent depth estimation from event cameras by distilling knowledge from Vision Foundation Models (VFMs). It introduces a tri-level distillation strategy, including Cross-Modal Feature Mixture (CMFM), Spatio-Temporal Feature Distillation (STFD), and Temporal Consistency Distillation (TCD), to leverage spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT). Experiments show EventVGGT significantly outperforms existing methods, reducing the absolute mean depth error at 30m by over 53\% on EventScape and demonstrating robust zero-shot generalization.

Key Contribution

Event cameras can now estimate depth with significantly improved temporal consistency and accuracy thanks to a novel distillation approach from video foundation models, achieving a 53% reduction in depth error.

Abstract

Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT's powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods -- reducing the absolute mean depth error at 30m by over 53\% on EventScape (from 2.30 to 1.06) -- while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.

Computer Vision Inference & Quantization Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

Related Papers