Search papers, labs, and topics across Lattice.
This paper investigates citation hallucination in LLMs, finding that author names are the most frequent source of error. They show that signals for hallucinating different citation fields don't generalize, and identify specific "FH-neurons" in Qwen2.5-32B-Instruct associated with field-specific hallucination using neuron-level CETT values and elastic-net regularization. Causal interventions targeting these neurons demonstrate that suppressing them reduces hallucination across citation fields.
LLMs have "hallucination neurons" for specific citation fields, and silencing them reduces fabrication.
LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.