NJUFeb 22, 2026arXiv:2602.19275

KUDA: Knowledge Unlearning by Deviating Representation for Large Language Models

Ce Fang, Zhikun Zhang, Min Chen, Qing Liu, Lu Zhou, Yunjun Gao

AI Summary

The paper introduces Knowledge Unlearning by Deviating representAtion (KUDA), a novel approach for removing specific knowledge from large language models (LLMs). KUDA uses causal tracing to identify knowledge-storing layers and then employs a new unlearning objective that deviates the model's representations from their original positions, disrupting associations with the target knowledge. To mitigate the impact on retained knowledge, a relaxation null-space projection mechanism is used, and experiments on WMDP and MUSE benchmarks demonstrate KUDA's superior performance in balancing knowledge removal and model utility.

Key Contribution

LLMs can selectively forget specific knowledge without sacrificing overall performance, thanks to a new representation-deviation technique that precisely targets and disrupts unwanted associations.

Abstract

Large language models (LLMs) acquire a large amount of knowledge through pre-training on vast and diverse corpora. While this endows LLMs with strong capabilities in generation and reasoning, it amplifies risks associated with sensitive, copyrighted, or harmful content in training data.LLM unlearning, which aims to remove specific knowledge encoded within models, is a promising technique to reduce these risks. However, existing LLM unlearning methods often force LLMs to generate random or incoherent answers due to their inability to alter the encoded knowledge precisely. To achieve effective unlearning at the knowledge level of LLMs, we propose Knowledge Unlearning by Deviating representAtion (KUDA). We first utilize causal tracing to locate specific layers for target knowledge storage. We then design a new unlearning objective that induces the model's representations to deviate from its original position in the phase of knowledge removal, thus disrupting the ability to associate with the target knowledge. To resolve the optimization conflicts between forgetting and retention, we employ a relaxation null-space projection mechanism to mitigate the disruption to the representation space of retaining knowledge. Extensive experiments on representative benchmarks, WMDP and MUSE, demonstrate that KUDA outperforms most existing baselines by effectively balancing knowledge removal and model utility retention.

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

KUDA: Knowledge Unlearning by Deviating Representation for Large Language Models

Related Papers