May 23, 2026arXiv:2605.24614

Measuring the Depth of LLM Unlearning via Activation Patching

AI Summary

This paper introduces the Unlearning Depth Score (UDS), a novel metric that quantifies the effectiveness of unlearning in large language models by assessing the mechanistic depth of knowledge erasure through activation patching. By evaluating 150 unlearned models across eight methods, UDS demonstrates superior faithfulness and robustness compared to 20 existing metrics, highlighting its reliability in auditing the unlearning process. The findings reveal significant discrepancies in layer-level evaluations, emphasizing the need for a more nuanced approach to measuring unlearning efficacy in AI systems.

Key Contribution

UDS reveals that traditional metrics can miss critical nuances in knowledge erasure, exposing the hidden complexities of LLM unlearning.

Abstract

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Measuring the Depth of LLM Unlearning via Activation Patching

Related Papers