Google ResearchIIScFeb 26, 2026arXiv:2602.23300

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta, Soumya Dutta, SM Balaji, Smruthi Balaji, Sriram Ganapathy

AI Summary

The paper introduces MiSTER-E, a Mixture-of-Experts (MoE) framework for Emotion Recognition in Conversations (ERC) that decouples modality-specific context modeling and multimodal fusion. It uses fine-tuned LLMs for speech and text embeddings, enhances them with convolutional-recurrent context modeling, and integrates predictions from speech-only, text-only, and cross-modal experts via a learned gating mechanism. The model is trained with a supervised contrastive loss and KL-divergence regularization, achieving state-of-the-art results on IEMOCAP, MELD, and MOSI datasets without relying on speaker identity.

Key Contribution

State-of-the-art emotion recognition in conversations is now possible by decoupling modality-specific context modeling and multimodal fusion with a mixture-of-experts approach that doesn't require speaker identity.

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Related Papers