Apr 22, 2026arXiv:2604.20763

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Andrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue

AI Summary

This paper addresses the limitations of current retrieval evaluation methods in retrieval-augmented generation (RAG) by formalizing the evaluation process as a statistical estimation problem, highlighting the biases introduced by heuristic query sets. The authors introduce a novel approach called semantic stratification, which organizes documents into entity-based clusters to ensure comprehensive coverage and generate queries that fill in gaps in evaluation. Experimental validation across various benchmarks reveals significant coverage gaps and demonstrates that stratified evaluation provides more reliable and interpretable assessments compared to traditional aggregate metrics.

Key Contribution

Systematic coverage gaps in retrieval evaluations can lead to misleading assessments, but semantic stratification offers a clearer, more trustworthy framework for understanding retrieval performance.

Abstract

Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Related Papers