Apr 22, 2026arXiv:2604.20392

Self-supervised pretraining for an iterative image size agnostic vision transformer

Nedyalko Prisadnikov, D. Paudel, Yuqian Fu, L. V. Gool

AI Summary

This paper introduces a novel self-supervised learning (SSL) framework for a foveal-inspired vision transformer that iteratively processes multi-zoom patches, enabling image-size agnostic processing. The approach leverages DINO's self-distillation objective within a sequential-to-global training paradigm, supported by an efficient integral-image patch extraction method. The resulting model achieves competitive performance on ImageNet-1K and downstream tasks while maintaining constant computational cost regardless of input resolution, addressing a key limitation of standard ViTs.

Key Contribution

Image-size agnostic vision transformers are now a practical reality, thanks to a new self-supervised pretraining method that maintains constant computational cost regardless of input resolution.

Abstract

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Self-supervised pretraining for an iterative image size agnostic vision transformer

Related Papers