PassauApr 30, 2026arXiv:2604.27878

SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions

AI Summary

SimEval-IR is introduced as a toolkit and benchmark suite for evaluating user simulators in interactive information retrieval, distinguishing between behavioral realism and tester reliability. The toolkit includes a canonical session schema, validated dataset adapters, and executable benchmarks for assessing both realism and reliability. The key finding is that the common "human-likeness" classifier check is a poor predictor of system-ranking validity, while click-depth distance and Fréchet distance over session embeddings are much stronger indicators.

Key Contribution

The standard "human-likeness" test for user simulators is essentially useless for predicting whether they produce valid system rankings.

Abstract

User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the classifier-discriminator''human-likeness''check, the dominant realism test in the literature, has essentially no pooled predictive power for system-ranking validity ($r{=}{+}0.09$, $n{=}48$), while marginal click-depth distance and Fr\'{e}chet distance over session embeddings give a much stronger signal ($|r|{=}0.43$ and $0.40$, $p{\leq}0.005$). SimEval-IR is released with all configurations and scripts to reproduce the reported analysis.

Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...