Search papers, labs, and topics across Lattice.
The paper introduces HybridRAG-Bench, a framework for evaluating retrieval-augmented language models on multi-hop reasoning tasks over hybrid knowledge sources (text and knowledge graphs). It addresses the problem of benchmark contamination by constructing question-answer pairs grounded in recent scientific literature from arXiv, ensuring that the required knowledge is unlikely to be memorized by the LLM. Experiments demonstrate that HybridRAG-Bench effectively distinguishes genuine retrieval and reasoning from parametric recall across diverse domains like AI, governance, and bioinformatics.
HybridRAG-Bench reveals that existing benchmarks overestimate the reasoning abilities of retrieval-augmented LLMs due to contamination, offering a more realistic evaluation using up-to-date scientific knowledge.
Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of their retrieval and reasoning capabilities becomes critical. However, many existing benchmarks increasingly overlap with LLM pretraining data, which means answers or supporting knowledge may already be encoded in model parameters, making it difficult to distinguish genuine retrieval and reasoning from parametric recall. We introduce HybridRAG-Bench, a framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. HybridRAG-Bench automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. The framework supports flexible domain and time-frame selection, enabling contamination-aware and customizable evaluation as models and knowledge evolve. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) demonstrate that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall, offering a principled testbed for evaluating hybrid knowledge-augmented reasoning systems. We release our code and data at github.com/junhongmit/HybridRAG-Bench.