Search papers, labs, and topics across Lattice.
The paper introduces ConvApparel, a new dataset of human-AI conversations collected using a dual-agent protocol with both "good" and "bad" recommenders, designed to mitigate the realism gap in LLM-based user simulators for conversational recommenders. The authors propose a validation framework combining statistical alignment, human-likeness scores, and counterfactual validation to evaluate simulator generalization. Experiments using this framework reveal a significant realism gap across simulators, but also demonstrate that data-driven simulators outperform prompted baselines, especially in counterfactual validation.
User simulators trained on existing datasets may be optimizing for unrealistic scenarios, as a new benchmark reveals a significant "realism gap" in their ability to generalize to diverse recommender behaviors.
The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.