Mar 27, 2026arXiv:2604.02368

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Duolei Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jian Jiao, Chen Ju, Ying-Qin Kong, Yiran Li, Mengyun Liu, Mengyu Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Zilin Shi, Yanle Ren, Zi-Jia Shi, Zaiyuan Wang, Shiyu Zhang, Wen-hua Yue, Shiyue Zhang, Xinyi Zhang, Kaiwen Zhao, Kai Zhao, Zhenwei Zhu

AI Summary

XpertBench is introduced as a new benchmark to evaluate LLMs on complex, open-ended tasks across 80 professional domains, using 1,346 tasks derived from expert submissions. To address evaluation biases, they introduce ShotJudge, an LLM-based evaluation paradigm calibrated with expert few-shot examples. Evaluation of SOTA LLMs on XpertBench reveals a performance ceiling with a peak success rate of only ~66%, highlighting a significant "expert-gap" in current AI systems.

Key Contribution

LLMs still fail to demonstrate expert-level proficiency, achieving only ~66% success on a new benchmark of real-world professional tasks spanning finance, healthcare, and law.

Abstract

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant"expert-gap"in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Related Papers