Search papers, labs, and topics across Lattice.
This paper investigates the reliability of user-centric evaluation of Conversational Recommender Systems (CRS) using static dialogue transcripts annotated by crowd workers. The authors collected 1,053 annotations from 124 crowd workers on 200 ReDial dialogues, employing the 18-dimensional CRS-Que framework. Results indicate that utilitarian dimensions exhibit moderate reliability upon aggregation, while social dimensions are less reliable, and a strong halo effect collapses many dimensions into a single quality signal.
Evaluating conversational AI with crowd workers? Turns out, you're mostly just measuring a general "good" vibe, not specific qualities like humanness or rapport.
User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable scalable evaluation, recent work increasingly relies on third-party annotations of static dialogue logs by crowd workers or large language models. However, the reliability of this practice remains largely unexamined. In this paper, we present a large-scale empirical study investigating the reliability and structure of user-centric CRS evaluation on static dialogue transcripts. We collected 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework. Using random-effects reliability models and correlation analysis, we quantify the stability of individual dimensions and their interdependencies. Our results show that utilitarian and outcome-oriented dimensions such as accuracy, usefulness, and satisfaction achieve moderate reliability under aggregation, whereas socially grounded constructs such as humanness and rapport are substantially less reliable. Furthermore, many dimensions collapse into a single global quality signal, revealing a strong halo effect in third-party judgments. These findings challenge the validity of single-annotator and LLM-based evaluation protocols and motivate the need for multi-rater aggregation and dimension reduction in offline CRS evaluation.