Feb 26, 2026arXiv:2602.22843

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Chong Wang, Chong Wang, Yabin Zhang, Yabin Zhang, Yunhe Gao, Yunhe Gao, M. Varma, Maya Varma, Clemence Mottez, Clemence Mottez, Faidra Patsatzi, Faidra Patsatzi, Jiaming Liu, Jiaming Liu, Jin Long, Jean-Benoit Delbrouck, Jean-Benoit Delbrouck, S. Gatidis, Sergios Gatidis, Akshay S. Chaudhari, Akshay S. Chaudhari, C. Langlotz, Curtis P. Langlotz

AI Summary

The authors introduce CheXficient, a chest X-ray (CXR) foundation model, demonstrating that active data curation during pretraining can be a cost-effective alternative to brute-force dataset scaling. CheXficient selectively prioritizes informative training samples, pretraining on only 22.7% of a large CXR dataset while using under 27.3% of the compute. The model achieves comparable or superior performance to models trained on the full dataset and other large-scale pretrained models across 20 benchmarks, particularly improving generalizability on rare conditions.

Key Contribution

Active data curation during pretraining lets you build a chest X-ray foundation model that rivals full-data models using just 23% of the data and compute.

Abstract

Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a"scale-at-all-costs"paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

Computer Vision Data Curation & Synthetic Data Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References80

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Related Papers