Search papers, labs, and topics across Lattice.
The paper introduces CodecMOS-Accent, a new MOS benchmark for evaluating neural audio codecs and LLM-based TTS models, with a focus on accented speech. The dataset includes 4,000 resynthesis and TTS samples from 24 systems, covering 32 speakers across ten accents, and is annotated with 19,600 subjective ratings for naturalness, speaker similarity, and accent similarity. Analysis of the benchmark reveals relationships between speaker and accent similarity, the predictive power of objective metrics, and accent-based perceptual biases in listeners.
Accented speech reveals perceptual biases in speech synthesis evaluation: listeners rate speakers with matching accents as more natural.
We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. The dataset comprises 4,000 codec resynthesis and TTS samples from 24 systems, featuring 32 speakers spanning ten accents. A large-scale subjective test was conducted to collect 19,600 annotations from 25 listeners across three dimensions: naturalness, speaker similarity, and accent similarity. This dataset does not only represent an up-to-date study of recent speech synthesis system performance but reveals insights including a tight relationship between speaker and accent similarity, the predictive power of objective metrics, and a perceptual bias when listeners share the same accent with the speaker. This dataset is expected to foster research on more human-centric evaluation for NAC and accented TTS.