Apr 7, 2026arXiv:2604.06154

Exclusive Unlearning

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki

AI Summary

The paper introduces Exclusive Unlearning (EU), a novel approach to mitigate harmful content generation in LLMs by selectively retaining beneficial knowledge and expressions while extensively forgetting everything else. EU is implemented by training the model to predict masked tokens from a dataset of safe and domain-specific content, effectively overwriting harmful associations learned during pre-training. Experiments show that EU significantly improves LLM safety against jailbreaks and harmful content generation, while preserving performance on target domains like medicine and mathematics.

Key Contribution

Forget everything bad: a new unlearning method wipes away harmful LLM behaviors by selectively preserving only the knowledge you want.

Abstract

When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Exclusive Unlearning

Related Papers