Search papers, labs, and topics across Lattice.
The paper investigates the validity of using emotion embedding similarity as an objective metric for evaluating emotional expressiveness in speech generation. Through adversarial tasks and human alignment tests, the authors demonstrate that emotion embeddings from encoders like emotion2vec are significantly influenced by linguistic and speaker variations, overshadowing emotional features. This leads to a misalignment with human perception and rewards acoustic mimicry rather than genuine emotional synthesis, questioning the reliability of this widely used evaluation method.
Widely used emotion embedding similarity metrics for speech generation are more sensitive to speaker and linguistic features than actual emotion, rendering them unreliable for evaluating emotional expressiveness.
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.