Search papers, labs, and topics across Lattice.
The authors investigate positional bias in Vision Transformers (ViTs), particularly in feature foundation models like DINOv2, using linear probing across various objectives and positional encodings. They find significant positional bias exists, hindering zero-shot adaptation in domains like material science where images lack a preferred direction. To mitigate this, they finetune ViTs with ALiBi relative positional encoding, demonstrating reduced positional bias while preserving general semantic understanding.
DINOv2's powerful visual features come with a hidden flaw: strong positional biases that ALiBi positional encoding can effectively mitigate.
Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.