DeepNeuro AIMay 5, 2026arXiv:2605.03998

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

AI Summary

The paper introduces EQUITRIAGE, a fairness audit evaluating five LLMs (Gemini-3-Flash, Nemotron-3-Super, DeepSeek-V3.1, Mistral-Small-3.2, GPT-4.1-Nano) on their Emergency Severity Index (ESI) assignment using 374,275 evaluations on MIMIC-IV-ED vignettes, including gender-swapped counterfactuals. The study reveals that all models exhibit significant flip rates when genders are swapped, with some showing directional female undertriage despite good calibration, highlighting a dissociation between within-group calibration and counterfactual invariance. The authors also show that interventions like demographic blinding and chain-of-thought prompting have model-dependent effects, and that the underlying mechanisms for gender bias can differ across models.

Key Contribution

LLMs can exhibit gender bias in emergency triage even when well-calibrated, and interventions effective for one model may backfire on another.

Abstract

Emergency department triage assigns patients an acuity score that determines treatment priority, and clinical evidence documents persistent gender disparities in human acuity assessment. As hospitals pilot large language models (LLMs) as triage decision support, a critical question is whether these models reproduce or mitigate known biases. We present EQUITRIAGE, a fairness audit of LLM-based ESI assignment evaluating five models (Gemini-3-Flash, Nemotron-3-Super, DeepSeek-V3.1, Mistral-Small-3.2, GPT-4.1-Nano) across 374,275 evaluations on 18,714 MIMIC-IV-ED vignettes under four prompt strategies. Of 9,368 originals, 9,346 are paired with a gender-swapped counterfactual. All five models produced flip rates above a pre-registered 5% threshold (9.9% to 43.8%). Two showed directional female undertriage (DeepSeek F/M 2.15:1, Gemini 1.34:1); two were near-parity; one had high sensitivity with weak male-direction asymmetry. DeepSeek's directional bias coexisted with a low outcome-linked calibration gap (0.013 against MIMIC-IV admission), a Chouldechova-style dissociation between within-group calibration and between-pair counterfactual invariance. Demographic blinding reduced Gemini's flip rate to 0.5%; an age-preserving blind variant left DeepSeek with residual F/M 1.25, implicating age as a residual channel. Chain-of-thought prompting degraded accuracy for all five models. A two-model ablation reveals opposite underlying mechanisms for the same directional phenotype: in Gemini the signal is emergent in the combined name+gender swap, while in DeepSeek the gender token alone carries it. EQUITRIAGE shows that group parity, counterfactual invariance, and gender calibration are distinct fairness properties, that intervention effectiveness is model-dependent, and that per-model counterfactual auditing should precede clinical deployment.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References62

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

Related Papers