Utrecht UniversityMay 6, 2026arXiv:2605.05025

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

AI Summary

This paper introduces a single-pass uncertainty quantification method for detecting LLM hallucinations based on attention divergence. The method measures the KL divergence between each attention head's distribution and a uniform distribution, training a logistic regression probe on these features to predict answer correctness. Experiments across datasets, tasks, and model families show that attention divergence is predictive of correctness and competitive with existing uncertainty estimation methods, with the signal concentrated in middle layers and on factual tokens.

Key Contribution

Attention heads hold the key to detecting LLM hallucinations, offering a lightweight, white-box alternative to expensive sampling or external models.

Abstract

We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback-Leibler divergence between each attention head's distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is highly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

Related Papers