Search papers, labs, and topics across Lattice.
This paper investigates gender bias in Mean Opinion Score (MOS) evaluations of speech quality, finding that male listeners consistently rate speech higher than female listeners, especially for low-quality speech. The authors show that standard MOS prediction models trained on aggregated scores inherit this bias, skewing predictions towards male perceptions. To mitigate this, they propose a gender-aware MOS model that learns separate gender-specific scoring patterns using binary group embeddings, leading to improved prediction accuracy for both genders.
Speech quality assessment is skewed: male listeners consistently give higher scores than female listeners, and standard MOS models learn and perpetuate this bias.
The Mean Opinion Score (MOS) serves as the standard metric for speech quality assessment, yet biases in human annotations remain underexplored. We conduct the first systematic analysis of gender bias in MOS, revealing that male listeners consistently assign higher scores than female listeners--a gap that is most pronounced in low-quality speech and gradually diminishes as quality improves. This quality-dependent structure proves difficult to eliminate through simple calibration. We further demonstrate that automated MOS models trained on aggregated labels exhibit predictions skewed toward male standards of perception. To address this, we propose a gender-aware model that learns gender-specific scoring patterns through abstracting binary group embeddings, thereby improving overall and gender-specific prediction accuracy. This study establishes that gender bias in MOS constitutes a systematic, learnable pattern demanding attention in equitable speech evaluation.