Mar 17, 2026arXiv:2603.16463

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li

AI Summary

The paper introduces HyDRA, a Hybrid-evidential Deductive Reasoning Architecture, to address the challenge of Open-Vocabulary Multimodal Emotion Recognition (OV-MER) where ambiguous multimodal cues hinder performance. HyDRA employs a Propose-Verify-Decide protocol to reconstruct nuanced emotional states by synthesizing evidence-grounded rationales from diverse latent perspectives. Reinforcement learning with hierarchical reward shaping is used to align reasoning trajectories with task performance, enabling the model to reconcile multimodal cues effectively.

Key Contribution

MLLMs often fail at multimodal emotion recognition due to premature commitment to data priors, but a new architecture, HyDRA, uses reinforcement learning to synthesize evidence-grounded rationales and significantly outperforms strong baselines, especially in ambiguous scenarios.

Abstract

Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines--especially in ambiguous or conflicting scenarios--while providing interpretable, diagnostic evidence traces.

Multimodal Models Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

Related Papers