Search papers, labs, and topics across Lattice.
The paper introduces UpBench, a dynamically evolving benchmark for evaluating LLM agents in real-world labor market scenarios using tasks sourced from Upwork. UpBench grounds evaluation in verified client transactions and financial outcomes, offering a more realistic assessment compared to static or synthetic benchmarks. The framework employs a rubric-based evaluation with expert freelancer feedback, enabling fine-grained analysis of agent performance and instruction-following fidelity.
Forget static datasets: UpBench grounds agent evaluation in the messy reality of the Upwork labor market, complete with financial incentives and expert human feedback.
As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.