Friedrich-Alexander UniversityHelmholtzImperialTechnical University MunichMay 6, 2026arXiv:2605.05161

Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Bernhard Kainz, Johanna P Mueller, Matthew Baugh, Cosmin Bercea

AI Summary

The paper introduces WALDO, a training-free framework for zero-shot anomaly localization in medical imaging using VLMs, which addresses the limitation of lacking healthy anatomical context by framing it as a comparative inference problem. WALDO leverages entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection, Goldilocks zone sampling to optimize reference similarity, and self-consistency aggregation. Experiments on the NOVA brain MRI benchmark demonstrate a 19% relative improvement over zero-shot baselines, achieving 43.5% mAP@30 with Qwen2.5-VL-72B.

Key Contribution

Counterintuitively, moderately similar reference images are the key to unlocking accurate VLM-based anomaly localization in medical imaging.

Abstract

Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves $43.5_{\pm1.6}\%$ mAP@30 (95\% CI: [40.4, 46.7]), representing a 19\% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves $32.0_{\pm6.5}\%$ and Qwen3-VL-32B achieves $32.0_{\pm6.6}\%$ mAP@30. Paired McNemar tests confirm statistical significance ($p<0.01$). Source code is available at https://github.com/bkainz/WALDO_MICCAI26_demo .

Computer Vision Multimodal Models Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Related Papers