Jun 4, 2026arXiv:2606.06286

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech

AI Summary

This paper introduces PropMe, a novel framework for evaluating memorization in large language models (LLMs) that distinguishes between forced memorization through capability attacks and ordinary use scenarios. By employing a lightweight tracing pipeline called SimpleTrace, the authors analyze two models, Comma and DFM Decoder, revealing a significant discrepancy between the models' ability to memorize under adversarial conditions versus their propensity to do so in typical interactions. The findings indicate that while LLMs can leak training data when prompted directly, they exhibit low memorization propensity in non-adversarial contexts, highlighting the need for comprehensive memorization audits that consider both worst-case and ordinary leakage scenarios.

Key Contribution

LLMs can leak training data when prompted, but they rarely do so in everyday use, revealing a critical gap in our understanding of model memorization.

Abstract

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

Related Papers