Search papers, labs, and topics across Lattice.
The paper introduces MERRY, a semantically decoupled evaluation framework for Multimodal Role-Playing Agents (MRPAs) that assesses emotional and role consistencies by disentangling semantic assessment from modality generation. MERRY refines evaluation through five emotional consistency (EC) and three role consistency (RC) metrics, transforming subjective scoring into a bidirectional-evidence-finding task to improve LLM-as-Judge agreement. Empirical evaluations using MERRY reveal that training data source impacts emotional consistency, models exhibit emotional templatization, and simple prompting/fine-tuning methods have varying effects based on model strength.
Current multimodal role-playing agents often fail at nuanced emotional expression, exhibiting biases towards positive emotions and struggling with role generalization, as revealed by a new decoupled evaluation framework.
Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.