Search papers, labs, and topics across Lattice.
This paper addresses the challenge of evaluating multi-turn dialogue by introducing an information-theoretic metric to measure semantic progress, defined as the accumulation of new and relevant information throughout a conversation. The authors formalize this metric using a Gaussian formulation that allows for closed-form updates, demonstrating desirable properties such as monotonicity and additive decomposition of information gain. Experimental results indicate that their metric achieves competitive alignment with human judgments on multiple dialogue benchmarks, outperforming LLM-based approaches while maintaining efficiency with lightweight embedding models.
Semantic progress in dialogue can be quantified effectively without relying on large models, achieving human-level agreement on information gain across turns.
Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.