Search papers, labs, and topics across Lattice.
The paper introduces Active Data Reconstruction Attack (ADRA), a novel membership inference attack (MIA) that actively trains a language model to reconstruct a given text, hypothesizing that training data are more reconstructible than non-members. ADRA leverages on-policy reinforcement learning (RL) to finetune a policy initialized from the target model, using reconstruction metrics and contrastive rewards to elicit data reconstruction. Experiments demonstrate that ADRA and its adaptive variant ADRA+ significantly outperform existing MIAs in detecting pre-training, post-training, and distillation data, achieving an average improvement of 10.7% over prior methods.
Forget passively analyzing model outputs – this new attack actively *trains* the model to regurgitate specific texts, revealing its training data with surprising accuracy.
Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.