Mar 4, 2026arXiv:2603.03969

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Zhiwen Chen, Junhui Hou, Zhiyu Zhu, Jinjian Wu, Guangming Shi

AI Summary

This paper introduces a self-supervised pretraining method that distills knowledge from visual foundation models (VFMs) to improve event representation learning from event streams. The method addresses the sparsity and granularity mismatches between image and event data by extending the alignment objective to semantic structures provided by VFMs. A structure-aware distillation loss is used to optimize dense event representations by grounding higher-quality image-event correspondences. The approach achieves significant improvements in downstream benchmarks compared to traditional and existing pretraining methods, demonstrating enhanced generalization, data efficiency, and transferability.

Key Contribution

By distilling visual foundation models, this work achieves a significant leap in event stream representation learning, surpassing prior methods in generalization, data efficiency, and transferability.

Abstract

Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Related Papers