Apr 18, 2025arXiv:2504.14039

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Jaime Raldua Veuthey, Zainab Ali Majid, Suhas Hariharan, Jacob Haimes

AI Summary

The paper introduces MEQA, a meta-evaluation framework for assessing the quality of Question Answering (QA) benchmarks used for evaluating Large Language Models (LLMs). MEQA provides standardized assessments and quantifiable scores to enable comparisons between benchmarks, addressing a critical gap in the rigorous evaluation of LLMs. The framework is demonstrated on cybersecurity benchmarks using both human and LLM evaluators, revealing specific strengths and weaknesses of these benchmarks.

Key Contribution

Stop blindly trusting benchmarks: MEQA offers a framework to rigorously evaluate the *quality* of QA benchmarks themselves, revealing hidden flaws and biases.

Abstract

As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations2

Influential citations0

References26

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Related Papers