Mar 2, 2026arXiv:2603.01625

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Aditya Parikh, Aasa Feragen, Stella Frank

AI Summary

The paper identifies a critical flaw in VLM evaluation for radiology report generation: reliance on token-overlap metrics that reward template collapse and mask the erasure of crucial clinical terminology. They demonstrate that deterministic decoding strategies, while achieving high benchmark scores, lead to significant semantic erasure. To address this, they introduce Clinical Association Displacement (CAD) and Weighted Association Erasure (WAE) to quantify demographic-based word association shifts and clinical signal loss.

Key Contribution

VLMs can ace radiology report benchmarks while silently erasing critical clinical details, especially for specific demographic groups.

Abstract

Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.

Eval Frameworks & Benchmarks Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Related Papers