Search papers, labs, and topics across Lattice.
The paper introduces MiSTER-E, a Mixture-of-Experts (MoE) framework for Emotion Recognition in Conversations (ERC) that decouples modality-specific context modeling and multimodal fusion. It uses fine-tuned LLMs for speech and text embeddings, enhances them with convolutional-recurrent context modeling, and integrates predictions from speech-only, text-only, and cross-modal experts via a learned gating mechanism. The model is trained with a supervised contrastive loss and KL-divergence regularization, achieving state-of-the-art results on IEMOCAP, MELD, and MOSI datasets without relying on speaker identity.
State-of-the-art emotion recognition in conversations is now possible by decoupling modality-specific context modeling and multimodal fusion with a mixture-of-experts approach that doesn't require speaker identity.
Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.