Search papers, labs, and topics across Lattice.
OccuBench is introduced as a benchmark to evaluate AI agents across 100 real-world professional task scenarios, spanning diverse industries and domains, using Language World Models (LWMs) to simulate environments. The benchmark includes a multi-agent synthesis pipeline for generating evaluation instances with guaranteed solvability and calibrated difficulty. Evaluation of 15 frontier models reveals varying occupational capability profiles, the difficulty of implicit faults, and the positive impact of larger models and increased reasoning effort, while also highlighting the importance of simulator quality for reliable LWM-based evaluation.
No single AI model dominates across all professional industries, revealing distinct occupational capability profiles and highlighting the need for specialized AI development.
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.