Indian Institute of Information TechnologyMar 19, 2026arXiv:2603.18530

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Abhinaba Basu, Abhinaba Basu, Pavan Chakraborty, Pavan Chakraborty

AI Summary

The paper introduces ICE-Guard, a framework using intervention consistency testing to detect spurious feature reliance in LLMs across demographic, authority, and framing dimensions. They evaluated 11 LLMs across 10 high-stakes domains and found that authority and framing biases are more prevalent than demographic biases, with significant variance across domains. Structured decomposition significantly reduces these biases, and an iterative prompt patching approach guided by ICE-Guard achieves substantial bias reduction.

Key Contribution

LLMs are far more susceptible to authority and framing biases than the field's obsession with demographic bias suggests.

Abstract

Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Related Papers