Department of Computer ScienceEmoryJun 11, 2026arXiv:2606.12789

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

Chase Fensore, Chase M. Fensore, Kaustubh Dhole, Kaustubh D. Dhole, Jason Fan, Eugene Agichtein, E. Agichtein, Joyce C. Ho

AI Summary

This paper introduces HieraRAG, a hierarchical framework designed to optimize the granularity of benchmarks for evaluating retrieval-augmented generation (RAG) systems. By generating 5,872 synthetic question-answer pairs across three dimensions and varying granularity levels, the study finds that question complexity benefits from fine-grained distinctions, while answer type and linguistic variation are best assessed at medium granularity. The introduction of the Coherence Ratio metric allows for a quantitative assessment of how well fine-grained splits delineate parent categories, providing valuable insights for practitioners in RAG evaluation.

Key Contribution

Optimal granularity in RAG benchmarks varies by dimension, with question complexity thriving on fine distinctions while other factors favor medium granularity.

Abstract

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

Related Papers