Search papers, labs, and topics across Lattice.
This paper benchmarks the fairness of speech recognition systems using different decoder architectures (CTC, encoder-decoder, and LLM-based) across five demographic axes. The study finds that LLM decoders do not necessarily amplify racial bias, but Whisper exhibits pathological hallucination on Indian-accented speech, and audio compression is a better predictor of accent fairness than LLM scale. Stress tests with acoustic degradation reveal that severe degradation compresses fairness gaps, while silence injection amplifies Whisper's accent bias, suggesting audio encoder design is key to equitable and robust speech recognition.
Counterintuitively, scaling up LLM decoders in speech recognition doesn't guarantee fairness; audio encoder design matters more, as Whisper's pathological hallucinations on Indian-accented speech and repetition loops under masking demonstrate.
As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.