Charles UniversityMar 10, 2026arXiv:2603.09403

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Lukáš Eigler, Jindřich Libovický, David Hurych

AI Summary

The paper introduces "LLM as a Meta-Judge," a framework using LLMs to generate synthetic evaluation datasets for NLP tasks by applying controlled semantic degradation to real data. This approach replaces reliance on expensive human annotations for validating NLG evaluation metrics, especially in multilingual settings. Validated using "meta-correlation" against human benchmarks, the framework achieves high alignment (meta-correlations > 0.9 in multilingual QA), demonstrating its potential as a cost-effective proxy for human judgment.

Key Contribution

Forget expensive human annotations: LLMs can reliably generate synthetic data to validate NLP evaluation metrics, even outperforming human agreement in some multilingual tasks.

Abstract

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Related Papers