Aligarh Muslim UniversityInterdisciplinary Center for Artificial IntelligenceZ.H. College of Engineering & TechnologyMar 18, 2026arXiv:2603.17872

Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob, Tamkeen Fatima

AI Summary

This paper introduces a domain-grounded tiered retrieval and verification pipeline, implemented using LangGraph, to mitigate hallucinations in LLMs. The pipeline incorporates intrinsic verification, adaptive search routing with a domain detector, corrective document grading (CRAG), and extrinsic regeneration with claim-level verification. Experiments across five benchmarks show the pipeline consistently outperforms zero-shot baselines, achieving win rates up to 83.7% and groundedness scores between 78.8% and 86.4%, while also identifying a "False-Premise Overclaiming" failure mode.

Key Contribution

LLMs can be systematically shifted from stochastic pattern-matchers to verified truth-seekers using a carefully orchestrated, multi-stage retrieval and verification pipeline.

Abstract

Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to"hallucinations"- the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of"False-Premise Overclaiming"was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval"answerability"nodes to further bridge the reliability gap in conversational AI.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...