Stanford HAIDKFZEPFLJHUNTUUZHJun 23, 2026arXiv:2606.24883

BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases

Qi Chen, Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal, Ibrahim Ethem Hamamci, Sezgin Er, Ashwin Kumar, Yiwen Ye, Yuhan Wang, Yuyin Zhou, Akshay S. Chaudhari, Curtis Langlotz, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou

AI Summary

This study introduces BenchX, a comprehensive benchmark consisting of 85,355 CT scans designed to evaluate the performance of 12 tumor-detection AI models across various demographic and imaging protocol biases. The findings indicate that state-of-the-art models, while achieving high average accuracy, significantly underperform in detecting tumors in rare or underrepresented patient subgroups, such as young, female African Americans. By leveraging large language models to extract and organize clinical data, this benchmark not only quantifies these discrepancies but also emphasizes the necessity for subgroup-level evaluations in medical imaging AI development.

Key Contribution

Current AI models miss critical tumor detections in underrepresented demographics, revealing a hidden bias that could compromise patient outcomes.

Abstract

Artificial intelligence (AI) has achieved remarkable success in medical imaging, but it is widely recognized that these models often perform inconsistently across real-world clinical settings. Such inconsistencies occur when patient demographics and imaging protocols vary, for example, in detecting small tumors, analyzing scans from different contrast phases, or evaluating patients of different ages or sexes. To quantify these inconsistencies, we develop a large-scale, open benchmark of 85,355 CT scans that systematically evaluates 12 tumor-detection AI models across tumor size, location, patient subgroup, and imaging protocol. We leverage large language models (LLMs) to extract and organize subgroup information from clinical data, which makes the analysis both scalable and reproducible. Our benchmark reveals that current state-of-the-art AI models, optimized for average accuracy, perform poorly in rare or underrepresented subgroups, such as young, female African Americans. However, collecting sufficient annotated data for these rare cases is often impractical. The benchmark provides a foundation for building more reliable and robust AI models for tumor detection and highlighting the need for rigorous, subgroup-level evaluation in medical imaging and computer vision. Datasets, code

Computer Vision Constitutional AI & AI Ethics Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

BenchX: Benchmarking AI Models for Cancer Detection and Localization with Demographic and Protocol Biases

Related Papers