Search papers, labs, and topics across Lattice.
The paper introduces VoxEmo, a new benchmark for speech emotion recognition (SER) designed specifically for evaluating speech LLMs using generative interfaces. VoxEmo encompasses 35 emotion corpora across 15 languages and includes a standardized toolkit with varying prompt complexities and a distribution-aware soft-label protocol to account for the inherent ambiguity of human emotion. Experiments using VoxEmo show that while zero-shot speech LLMs underperform supervised baselines in hard-label accuracy, they demonstrate a better alignment with human subjective distributions.
Speech LLMs, though lagging in accuracy, capture the nuances of human emotion perception better than traditional supervised methods, a finding revealed by the new VoxEmo benchmark.
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.