Gilbert AI LabNTU TaiwanUSCApr 29, 2026arXiv:2604.26347

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Huang-Cheng Chou, Tzu-wen Hsu, Tzu-Wen Hsu, Yun-Man Hsu, Yun-Man Hsu, Chun Wei Chen, Shrikanth Narayanan, Hung-yi Lee

AI Summary

The paper investigates the validity of using emotion embedding similarity as an objective metric for evaluating emotional expressiveness in speech generation. Through adversarial tasks and human alignment tests, the authors demonstrate that emotion embeddings from encoders like emotion2vec are significantly influenced by linguistic and speaker variations, overshadowing emotional features. This leads to a misalignment with human perception and rewards acoustic mimicry rather than genuine emotional synthesis, questioning the reliability of this widely used evaluation method.

Key Contribution

Widely used emotion embedding similarity metrics for speech generation are more sensitive to speaker and linguistic features than actual emotion, rendering them unreliable for evaluating emotional expressiveness.

Abstract

Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Related Papers