Mar 9, 2026arXiv:2603.08230

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Xiao Yu, Xiaofeng Yu, Jiaheng Dong, Jean Honorio, Abhirup Ghosh, Ting Dang

AI Summary

This paper addresses the limitations of Large Audio-Language Models (LALMs) in understanding the ambiguous nature of human emotions by reformulating emotion recognition as a distributional reasoning problem. They introduce an ambiguity-aware objective function to align model predictions with human perceptual distributions and a structured chain-of-thought supervision to guide reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D datasets show that their framework consistently improves performance across various training strategies like SFT, DPO, and GRPO.

Key Contribution

LALMs can now better capture the nuances of human emotion, moving beyond single-label predictions with a new ambiguity-aware training framework that aligns model outputs with the full spectrum of human perception.

Abstract

Speech emotion recognition plays an important role in various applications. However, most existing approaches predict a single emotion label, oversimplifying the inherently ambiguous nature of human emotional expression. Recent large audio-language models show promise in generating richer outputs, but their reasoning ability for ambiguous emotional understanding remains limited. In this work, we reformulate ambiguous emotion recognition as a distributional reasoning problem and present the first systematic study of ambiguity-aware reasoning in LALMs. Our framework comprises two complementary components: an ambiguity-aware objective that aligns predictions with human perceptual distributions, and a structured ambiguity-aware chain-of-thought supervision that guides reasoning over emotional cues. Experiments on IEMOCAP and CREMA-D demonstrate consistent improvements across SFT, DPO, and GRPO training strategies.

Multimodal Models Reasoning & Chain-of-Thought Speech & Audio

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Disentangling Reasoning in Large Audio-Language Models for Ambiguous Emotion Prediction

Related Papers