Search papers, labs, and topics across Lattice.
This paper introduces DUAL, a benchmark dataset of 28.6k Wikidata triplets annotated with fact popularity metrics (Wikipedia link counts and LLM-based salience scores) to investigate machine unlearning in LLMs. The study reveals that unlearning performance differs significantly depending on whether the knowledge originates from pretraining or supervised fine-tuning (SFT). The key finding is that SFT-based unlearning achieves smoother forgetting, more stable tuning, and higher retention compared to direct unlearning on pretrained models, which is prone to instability and catastrophic forgetting.
Unlearning is much easier on supervised fine-tuned models than on pretrained ones, with direct unlearning on pretrained models often leading to catastrophic forgetting.
Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.