Mar 5, 2026arXiv:2603.04969

MPCEval: A Benchmark for Multi-Party Conversation Generation

Minxing Zhang, Yi Yang, Zhuofan Jia, Xuan Yang, Jian Pei, Yu Zang, Yuchen Zang, Xingwang Deng, Xianglong Chen

AI Summary

MPCEval is introduced as a benchmark to evaluate multi-party conversation generation, addressing the limitations of existing metrics in capturing the complexities of such conversations. It decomposes generation quality into speaker modeling, content quality, and speaker-content consistency, distinguishing between local next-turn prediction and global full-conversation generation. Experiments using MPCEval on diverse datasets reveal dimension-specific model characteristics, highlighting the importance of considering multiple evaluation objectives for a comprehensive assessment of multi-party conversational behavior.

Key Contribution

Single-score evaluations hide critical differences in multi-party conversational AI, so MPCEval breaks down generation quality into speaker modeling, content, and consistency to reveal nuanced model behaviors.

Abstract

Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MPCEval: A Benchmark for Multi-Party Conversation Generation

Related Papers