Feb 26, 2026arXiv:2602.22562

Layer-Targeted Multilingual Knowledge Erasure in Large Language Models

Taoran Li, Varun Chandrasekaran, Varun Chandrasekaran, Zhiyuan Yu, Zhiyuan Yu

AI Summary

The paper investigates the failure of multilingual knowledge erasure in LLMs, identifying intervention depth as a critical factor. They show that shallow-layer interventions compromise multilingual capabilities, while deep-layer interventions fail to erase knowledge effectively. To address this, they propose MUTE, a framework that identifies language-agnostic layers using CKA and LRDS for targeted unlearning.

Key Contribution

LLMs can't easily forget things across languages because erasing knowledge in one language often breaks the model in another, unless you carefully target the right layers.

Abstract

Recent work has demonstrated that machine unlearning in Large Language Models (LLMs) fails to generalize across languages: knowledge erased in one language frequently remains accessible through others. However, the underlying cause of this failure and a principled solution remain open. In this work, we identify intervention depth as the key factor determining multilingual generalization. Through systematic layer-wise experiments, we characterize two distinct failure modes: shallow-layer interventions achieve erasure but collapse multilingual capabilities in held-out languages, while deep-layer interventions preserve utility but fail to erase target knowledge even in source languages. These findings reveal that the choice of intervention layer is not a free parameter; it fundamentally determines whether multilingual unlearning succeeds. We propose MUTE (Multilingual Unlearning via Targeted Erasure), a framework that uses Centered Kernel Alignment (CKA) and Linguistic Regions Development Score (LRDS) to identify intermediate, language-agnostic layers where cross-lingual representations converge. By restricting unlearning updates to these layers, MUTE achieves robust multilingual knowledge erasure while optimizing on only a small set of source languages. Extensive experiments across three LLM architectures and three unlearning algorithms validate our approach, with mechanistic analysis via Logit Lens probing confirming genuine knowledge removal rather than output-level suppression.

Interpretability & Mechanistic Interp Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Layer-Targeted Multilingual Knowledge Erasure in Large Language Models

Related Papers