Jun 8, 2026arXiv:2606.09570

UXBench: Benchmarking User Experience in AI Assistants

Mengze Hong, Xia Zeng, Zeyang Lei, Sheng Wang, Chen Jason Zhang, Di Jiang, Taiming Fu, Jinfeng Huang, Mengqiao Liu, Qinghe Chang, Haosheng Zou, Qiongyi Zhou, Sijun He, Chen Xiaoshuai, Simon Deng, Haojing Huang, Zijian Li, Lucas Mu Li, Fubao Zhang, Mona Zhou, Wei Ma, Chenxuan Ma, Yuanmeng Zhang, Jian Song, Minlong Peng, Di Liang, Davey Chen

AI Summary

This paper introduces UXBench, a pioneering user-centric benchmark designed to evaluate AI assistants based on real user feedback, addressing the critical need for assessing user experience (UX) beyond mere model performance. The benchmark comprises three tasks—UX Judge, UX Eval, and UX Recovery—utilizing a dataset of 7,400 instances derived from over 70,000 interaction logs, reflecting diverse user scenarios and failure patterns. Results from extensive experiments on 26 advanced language models reveal that user feedback prediction can be effectively learned, highlighting significant performance gaps and biases in current evaluation protocols, ultimately advocating for a shift towards user-focused optimization in AI development.

Key Contribution

User feedback prediction can be learned, revealing critical performance gaps in AI assistants that traditional evaluation methods overlook.

Abstract

As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.

Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UXBench: Benchmarking User Experience in AI Assistants

Related Papers