Tsinghua AIHefei Comprehensive National ScienceHFUTUSTBJun 1, 2026arXiv:2606.01629

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai

AI Summary

This paper introduces LongJudgeBench, a novel benchmark designed to evaluate the effectiveness of large language models (LLMs) as judges for long-form outputs, addressing a critical gap in existing evaluation frameworks that predominantly focus on short-form content. The authors systematically assess various LLM judges across multiple base models and judging protocols, revealing a significant reliability gap in their performance when tasked with complex document-level evaluations. The findings indicate that while rubrics and references can enhance evaluation stability, they are not universally effective, highlighting the need for more robust and context-aware LLM judging methods.

Key Contribution

Current LLM judges show a troubling reliability gap in long-form evaluations, raising questions about their effectiveness in real-world applications.

Abstract

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://anonymous.4open.science/r/LongJudgeBench-F782.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Related Papers