MIT CSAILHarvardLMUNotre DameRochesterUPennJun 15, 2026arXiv:2606.16262

UXBench: Measuring the Actionability of LLM-Generated UX Critiques

Wenjie Wang, Yue Huang, Zipeng Ling, Han Bao, Hang hua, Xiaonan Luo, Yu Jiang, Shiyi Du, Yuexing Hao, Xiaomin Li, Yuchen Ma, Dianzhuo Wang, Yanfang Ye, Xiangliang Zhang

AI Summary

This paper introduces UXBench, a novel benchmark designed to evaluate the actionability of UX critiques generated by large language models (LLMs) across diverse product surfaces. The study reveals significant variability in the actionability of critiques from different models, demonstrating that no single model consistently outperforms others across all dimensions and product types. Key findings indicate that models exhibit distinct strengths and weaknesses in their UX reports, which can directly influence the effectiveness of downstream interface repairs.

Key Contribution

UXBench reveals that LLMs vary dramatically in their ability to produce actionable UX critiques, challenging the assumption of uniform model performance in this domain.

Abstract

Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories

Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UXBench: Measuring the Actionability of LLM-Generated UX Critiques

Related Papers