Jun 25, 2025arXiv:2506.20274

Enterprise Large Language Model Evaluation Benchmark

Liya Wang, David Yi, Damien Jose, John Passarelli, James Gao, Jordan Leventis, Kang Li

AI Summary

The authors introduce a new benchmark, the Enterprise Large Language Model Evaluation Benchmark, comprising 14 enterprise-specific tasks categorized by Bloom's Taxonomy, to address the limitations of existing benchmarks like MMLU in evaluating LLMs for enterprise applications. They develop a scalable data curation pipeline using LLM-as-a-Labeler, LLM-as-a-Judge, and corrective retrieval-augmented generation (CRAG) to create a 9,700-sample dataset. Evaluation of six leading models reveals that open-source models like DeepSeek R1 perform competitively in reasoning tasks but underperform in judgment-based tasks compared to proprietary models.

Key Contribution

Open-source LLMs can rival proprietary models in enterprise reasoning tasks, but watch out—they struggle with judgment calls, possibly due to overthinking.

Abstract

Large Language Models (LLMs) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom’s Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy data and costly annotation, we develop a scalable pipeline combining LLM-as-a-Labeler, LLM-as-aJudge, and corrective retrieval-augmented generation (CRAG), curating a robust 9,700-sample benchmark. Evaluation of six leading models shows open-source contenders like DeepSeek R1 rival proprietary models in reasoning tasks but lag in judgment-based scenarios, likely due to overthinking. Our benchmark reveals critical enterprise performance gaps and offers actionable insights for model optimization. This work provides enterprises a blueprint for tailored evaluations and advances practical LLM deployment.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations1

Influential citations0

References89

Year2025

VenueMachine Learning Techniques and NLP

Related Papers

Finding related papers...

Search

Enterprise Large Language Model Evaluation Benchmark

Related Papers