Nov 15, 2025arXiv:2511.12306

UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinha, Pablo Mendes, Andrew Rabinovich

AI Summary

The paper introduces UpBench, a dynamically evolving benchmark for evaluating LLM agents in real-world labor market scenarios using tasks sourced from Upwork. UpBench grounds evaluation in verified client transactions and financial outcomes, offering a more realistic assessment compared to static or synthetic benchmarks. The framework employs a rubric-based evaluation with expert freelancer feedback, enabling fine-grained analysis of agent performance and instruction-following fidelity.

Key Contribution

Forget static datasets: UpBench grounds agent evaluation in the messy reality of the Upwork labor market, complete with financial incentives and expert human feedback.

Abstract

As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations1

Influential citations0

References12

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Related Papers