Apr 24, 2025arXiv:2504.17550

HalluLens: LLM Hallucination Benchmark

Yejin Bang, Ziwei Ji, A. Schelten, A. Hartshorn, T. Fowler, Cheng Zhang, Nicola Cancedda, Pascale Fung

AI Summary

The paper introduces HalluLens, a benchmark designed to evaluate and analyze hallucinations in large language models (LLMs) by providing a clear taxonomy that distinguishes between extrinsic (inconsistent with training data) and intrinsic (inconsistent with the input prompt) hallucinations. It addresses the limitations of existing benchmarks by incorporating dynamically generated test sets to mitigate data leakage and saturation, focusing on extrinsic hallucinations which are increasingly relevant as LLMs evolve. The benchmark provides a unified framework for evaluating hallucinations, promoting consistency and facilitating research in this area.

Key Contribution

HalluLens offers a fresh taxonomy and dynamic benchmarks to finally disentangle and measure the slippery problem of LLM hallucinations, going beyond just "factuality".

Abstract

Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as"hallucination."These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from"factuality,"proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations52

Influential citations7

References81

Year2025

VenueAnnual Meeting of the Association for Computational Linguistics

Related Papers

Finding related papers...

Search

HalluLens: LLM Hallucination Benchmark

Related Papers