Mar 17, 2026arXiv:2603.16840

What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Moritz Pawlowsky, Antonis Vamvakeros, Alexander Weiss, Anja Bielefeld, Samuel J. Cooper, Ronan Docherty

AI Summary

The authors investigate positional bias in Vision Transformers (ViTs), particularly in feature foundation models like DINOv2, using linear probing across various objectives and positional encodings. They find significant positional bias exists, hindering zero-shot adaptation in domains like material science where images lack a preferred direction. To mitigate this, they finetune ViTs with ALiBi relative positional encoding, demonstrating reduced positional bias while preserving general semantic understanding.

Key Contribution

DINOv2's powerful visual features come with a hidden flaw: strong positional biases that ALiBi positional encoding can effectively mitigate.

Abstract

Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Related Papers