Mar 9, 2026arXiv:2603.08274

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

AI Summary

The paper introduces RIKER, a ground-truth-first evaluation methodology, to measure hallucination rates in 35 open-weight LLMs across varying context lengths, temperatures, and hardware platforms, totaling 172 billion tokens of evaluation. Key findings include a non-trivial fabrication rate even in top models (1.19% at 32K context), a steep increase in fabrication with longer contexts, and the dominance of model selection over other factors like temperature in determining accuracy. The study also reveals that grounding ability and fabrication resistance are distinct capabilities, and that results are consistent across hardware platforms.

Key Contribution

LLMs hallucinate far more than you think in document Q&A, with fabrication rates tripling as context grows from 32K to 128K tokens, and model selection matters more than hyperparameter tuning or hardware.

Abstract

How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.

Distributed Systems & Hardware Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References14

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

Related Papers