Search papers, labs, and topics across Lattice.
This paper introduces a controlled pairwise evaluation framework for multilingual TTS across 10 Indic languages, addressing the high variance inherent in speech perception. They collected over 120K pairwise comparisons from native raters, evaluating 7 state-of-the-art TTS systems across six perceptual dimensions. The study uses Bradley-Terry modeling to generate a multilingual leaderboard and SHAP analysis to interpret human preferences, revealing model strengths and trade-offs.
Forget English – this study reveals which TTS systems truly resonate with native speakers across ten diverse Indian languages, pinpointing specific perceptual dimensions that drive preference.
Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.