Search papers, labs, and topics across Lattice.
The paper introduces SURE-RAG, a method for verifying the sufficiency of retrieved evidence in retrieval-augmented generation (RAG) systems by predicting whether the evidence supports, refutes, or is insufficient for a given question and candidate answer. SURE-RAG aggregates pair-level claim-evidence verifications into answer-level signals like coverage, relation strength, and disagreement, enabling selective answering and auditability. Experiments on HotpotQA-RAG v3 demonstrate that SURE-RAG achieves state-of-the-art performance in evidence sufficiency verification, significantly reducing unsafe answers compared to baseline methods.
RAG systems can now reduce unsafe answers by 37% using SURE-RAG, a transparent evidence verification method that outperforms even GPT-4o in controlled sufficiency tasks.
Retrieval-augmented generation (RAG) grounds answers in retrieved passages, but retrieval is not verification: a passage can be topical and still fail to justify the answer. We frame this gap as evidence sufficiency verification for selective RAG answering: given a question, a candidate answer, and retrieved evidence, predict whether the evidence supports, refutes, or is insufficient, and abstain unless support is established. We present SURE-RAG, a transparent aggregation protocol built on the observation that evidence sufficiency is a set-level property: missing hops and unresolved conflicts cannot be detected by independent passage scoring. A shared pair-level claim-evidence verifier produces local relation distributions, which SURE-RAG aggregates into interpretable answer-level signals -- coverage, relation strength, disagreement, conflict, and retrieval uncertainty -- yielding a three-way decision and an auditable selective score. We evaluate on HotpotQA-RAG v3, a controlled multi-hop benchmark, under an artifact-aware protocol (shortcut baselines, counterfactual swaps, no-oracle checks, GPT-4o audits). Calibrated SURE-RAG reaches 0.9075 Macro-F1 (0.8951 +/- 0.0069), substantially above DeBERTa mean-pooling (0.6516) and a GPT-4o judge (0.7284), while matching a strong but opaque concat cross-encoder (0.8888 +/- 0.0109) with full auditability. Risk at 30% coverage drops from 0.2588 to 0.1642, a 37% reduction in unsafe answers. To deliberately probe the task boundary, we further contrast SURE-RAG with GPT-4o on HaluBench unsafe detection: the ranking reverses (0.3343 vs 0.7389 unsafe-F1), establishing that controlled sufficiency verification and natural hallucination detection are distinct problems.