Friedrich-Alexander-UniversitätImperialKU LeuvenLeidenUniversity Hospitals LeuvenUniversity Medical Center UtrechtJun 15, 2026arXiv:2606.16658

Vision-Language Models as Zero-Annotation Oracles in Histopathology

Vishal Jain, Giorgio Buzzanca, Sarah Cechnicka, Maarten Naesens, Priyanka Koshy, Tri Nguyen, Jesper Kers, Candice Roufosse, Bernhard Kainz

AI Summary

This paper introduces a coarse-to-fine approach that leverages vision-language models (VLMs) as zero-annotation oracles for foreground segmentation in histopathology, addressing the limitations of traditional supervised models that struggle with specialized stains. By framing tissue-versus-background discrimination as a natural-image recognition task, the authors demonstrate that VLMs trained on diverse datasets can outperform domain-specific models, achieving superior segmentation quality on out-of-distribution stains with significantly lower variance. The proposed method not only matches human expert consensus in annotation review but also enables the distillation of lightweight models that maintain high performance at reduced computational costs.

Key Contribution

VLMs can achieve state-of-the-art segmentation in histopathology without any manual annotations, outperforming traditional models on challenging stains.

Abstract

Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution H&E. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Vision-Language Models as Zero-Annotation Oracles in Histopathology

Related Papers