CUHKGenesysthe Research Ireland Centre for SoftwareMar 29, 2026arXiv:2603.27752

Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG

Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Muñoz

AI Summary

RT4CHART is introduced as a retromorphic testing framework that decomposes LLM outputs in RAG into verifiable claims and performs hierarchical verification against retrieved context to detect hallucinations. This approach assigns entailment, contradiction, or baseless labels to each claim and maps these decisions back to specific answer spans, providing fine-grained auditing. Evaluated on RAGTruth++ and a re-annotated RAGTruth-Enhance, RT4CHART significantly outperforms existing baselines in hallucination detection, achieving an F1 score of 0.776 on RAGTruth++ and 47.5% span-level F1 on RAGTruth-Enhance.

Key Contribution

Hallucinations in RAG are far more pervasive than we thought: re-annotating existing benchmarks reveals 1.68x more instances of unsupported claims, and a new framework, RT4CHART, dramatically improves detection.

Abstract

Large language models (LLMs) continue to hallucinate in retrieval-augmented generation (RAG), producing claims that are unsupported by or conflict with the retrieved context. Detecting such errors remains challenging when faithfulness is evaluated solely with respect to the retrieved context. Existing approaches either provide coarse-grained, answer-level scores or focus on open-domain factuality, often lacking fine-grained, evidence-grounded diagnostics. We present RT4CHART, a retromorphic testing framework for context-faithfulness assessment. RT4CHART decomposes model outputs into independently verifiable claims and performs hierarchical, local-to-global verification against the retrieved context. Each claim is assigned one of three labels: entailed, contradicted, or baseless. Furthermore, RT4CHART maps claim-level decisions back to specific answer spans and retrieves explicit supporting or refuting evidence from the context, enabling fine-grained and interpretable auditing. We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark. RT4CHART achieves the best answer-level hallucination detection F1 among all baselines. On RAGTruth++, it reaches an F1 score of 0.776, outperforming the strongest baseline by 83%. On RAGTruth-Enhance, it achieves a span-level F1 of 47.5%. Ablation studies show that the hierarchical verification design is the primary driver of performance gains. Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.

Eval Frameworks & Benchmarks Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG

Related Papers