Search papers, labs, and topics across Lattice.
This paper introduces a multi-agent framework for multimodal empathetic response generation (MERG) that incorporates structured reasoning and reflective refinement to address limitations in existing one-pass generation paradigms. The framework decomposes response generation into multimodal perception, emotion forecasting, strategy planning, and guided response generation, followed by a global reflection agent that audits intermediate states and the generated response for emotional biases. Experiments on IEMOCAP and MELD datasets demonstrate superior empathic response generation capabilities compared to state-of-the-art methods.
Achieve more human-like empathy in multimodal response generation by explicitly modeling the hierarchical progression of emotion perception and iteratively refining responses to eliminate emotional biases.
Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users'multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.