Mar 13, 2026arXiv:2603.13594

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, S. Nemala, Srinivas Sunkara, Sai Rajeswar

AI Summary

The paper introduces EnterpriseOps-Gym, a new benchmark for evaluating agentic planning in realistic enterprise environments, featuring a containerized sandbox with databases and functional tools. They evaluated 14 frontier models on 1,150 expert-curated tasks across eight enterprise verticals, finding that even the best model (Claude Opus 4.5) achieves only 37.4% success, highlighting limitations in strategic reasoning and task refusal. Providing oracle human plans significantly improves performance, suggesting strategic reasoning is a key bottleneck.

Key Contribution

Current LLM agents are nowhere near ready for autonomous enterprise deployment, with even the best models failing at strategic reasoning and often attempting infeasible tasks with potentially harmful consequences.

Abstract

Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.

Eval Frameworks & Benchmarks Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Related Papers