Search papers, labs, and topics across Lattice.
This paper introduces AnimeScore, a preference-based dataset and framework for automatically evaluating the anime-likeness of speech, addressing the limitations of MOS protocols due to the lack of a shared absolute scale for this subjective quality. The authors collected 15,000 pairwise preference judgments and found that anime-likeness is correlated with controlled resonance shaping, prosodic continuity, and deliberate articulation. They demonstrate that SSL-based ranking models trained on AnimeScore achieve up to 90.8% AUC, outperforming handcrafted acoustic features and enabling its use as a reward signal for generative speech models.
Forget MOS: a new preference-based metric, AnimeScore, finally cracks the code for automatically evaluating "anime-like" speech with 90.8% AUC.
Evaluating'anime-like'voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.