Feb 17, 2026arXiv:2602.16019

MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

Ahmad Elallaf, Yuktha Priya Masupalli, Jeong Yang, Young Lee, Zechun Cao, Gongbo Liang

AI Summary

MedProbCLIP is introduced as a probabilistic vision-language framework that models chest X-ray and radiology report representations as Gaussian embeddings, using a probabilistic contrastive objective to capture uncertainty and many-to-many correspondences. The framework incorporates a variational information bottleneck to prevent overconfident predictions and employs multi-view radiograph and multi-section report encoding for fine-grained supervision. Experiments on MIMIC-CXR demonstrate that MedProbCLIP outperforms deterministic and probabilistic baselines in retrieval and zero-shot classification, while also exhibiting superior calibration, risk-coverage, selective retrieval reliability, and robustness to clinically relevant corruptions.

Key Contribution

Radiology image-text retrieval gets a dose of trustworthiness with MedProbCLIP, which uses probabilistic embeddings to quantify uncertainty and improve reliability in high-stakes clinical applications.

Abstract

Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

Related Papers