Search papers, labs, and topics across Lattice.
This paper introduces a fairness evaluation framework for automated prior authorization (PA) systems that focuses on model error rates instead of approval rates, acknowledging legitimate clinical guideline differences across demographic groups. The authors evaluated a PA model using 7,166 human-reviewed cases across 27 guidelines, assessing consistency in error rates across sex, age, race/ethnicity, and socioeconomic status using error-rate comparisons, tolerance bands, statistical power evaluation, and logistic regression. The results showed consistent error rates across most demographics, but inconclusive evidence for race/ethnicity due to limited subgroup sample sizes.
You can't use naive parity metrics for fairness in healthcare AI: this framework uses error rates to account for legitimate clinical differences across demographic groups.
Increasing staffing constraints and turnaround-time pressures in Prior authorization (PA) have led to increasing automation of decision systems to support PA review. Evaluating fairness in such systems poses unique challenges because legitimate clinical guidelines and medical necessity criteria often differ across demographic groups, making parity in approval rates an inappropriate fairness metric. We propose a fairness evaluation framework for prior authorization models based on model error rates rather than approval outcomes. Using 7,166 human-reviewed cases spanning 27 medical necessity guidelines, we assessed consistency in sex, age, race/ethnicity, and socioeconomic status. Our evaluation combined error-rate comparisons, tolerance-band analysis with a predefined $\pm$5 percentage-point margin, statistical power evaluation, and protocol-controlled logistic regression. Across most demographics, model error rates were consistent, and confidence intervals fell within the predefined tolerance band, indicating no meaningful performance differences. For race/ethnicity, point estimates remain small, but subgroup sample sizes were limited, resulting in wide confidence intervals and underpowered tests, with inconclusive evidence within the dataset we explored. These findings illustrate a rigorous and regulator-aligned approach to fairness evaluation in administrative healthcare AI systems.