Search papers, labs, and topics across Lattice.
This paper introduces HumDial-EIBench, a new benchmark for evaluating emotional intelligence in Audio Language Models (ALMs) using human-recorded dialogues from the ICASSP 2026 HumDial Challenge. The benchmark reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors and includes an acoustic-semantic conflict task. Experiments on eight ALMs reveal deficiencies in multi-turn emotional tracking, causal reasoning, and robustness against acoustic-semantic conflicts, highlighting a text-dominance bias.
ALMs may ace the text, but HumDial-EIBench reveals they're shockingly bad at understanding the emotional nuances of real human conversations.
Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs'EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.