SheffieldUSCMar 9, 2026arXiv:2603.08936

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Hezhao Zhang, Huang-Cheng Chou, Shrikanth S. Narayanan, Thomas Hain

AI Summary

The paper introduces VoxEmo, a new benchmark for speech emotion recognition (SER) designed specifically for evaluating speech LLMs using generative interfaces. VoxEmo encompasses 35 emotion corpora across 15 languages and includes a standardized toolkit with varying prompt complexities and a distribution-aware soft-label protocol to account for the inherent ambiguity of human emotion. Experiments using VoxEmo show that while zero-shot speech LLMs underperform supervised baselines in hard-label accuracy, they demonstrate a better alignment with human subjective distributions.

Key Contribution

Speech LLMs, though lagging in accuracy, capture the nuances of human emotion perception better than traditional supervised methods, a finding revealed by the new VoxEmo benchmark.

Abstract

Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.

Eval Frameworks & Benchmarks Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References62

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs

Related Papers