UWCornellFeb 22, 2026arXiv:2602.19020

Learning to Detect Language Model Training Data via Active Reconstruction

Junjie Oscar Yin, John X. Morris, John X. Morris, Vitaly Shmatikov, Vitaly Shmatikov, Sewon Min, Sewon Min, Hannaneh Hajishirzi, Hanna Hajishirzi

AI Summary

The paper introduces Active Data Reconstruction Attack (ADRA), a novel membership inference attack (MIA) that actively trains a language model to reconstruct a given text, hypothesizing that training data are more reconstructible than non-members. ADRA leverages on-policy reinforcement learning (RL) to finetune a policy initialized from the target model, using reconstruction metrics and contrastive rewards to elicit data reconstruction. Experiments demonstrate that ADRA and its adaptive variant ADRA+ significantly outperform existing MIAs in detecting pre-training, post-training, and distillation data, achieving an average improvement of 10.7% over prior methods.

Key Contribution

Forget passively analyzing model outputs – this new attack actively *trains* the model to regurgitate specific texts, revealing its training data with surprising accuracy.

Abstract

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

Data Curation & Synthetic Data Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning to Detect Language Model Training Data via Active Reconstruction

Related Papers