Search papers, labs, and topics across Lattice.
This study introduces BenchX, a comprehensive benchmark consisting of 85,355 CT scans designed to evaluate the performance of 12 tumor-detection AI models across various demographic and imaging protocol biases. The findings indicate that state-of-the-art models, while achieving high average accuracy, significantly underperform in detecting tumors in rare or underrepresented patient subgroups, such as young, female African Americans. By leveraging large language models to extract and organize clinical data, this benchmark not only quantifies these discrepancies but also emphasizes the necessity for subgroup-level evaluations in medical imaging AI development.
Current AI models miss critical tumor detections in underrepresented demographics, revealing a hidden bias that could compromise patient outcomes.
Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code