JHUApr 16, 2026arXiv:2604.15166

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

Arman Hatami, Romina Aalishah, I. Monosov, Ilya E. Monosov

AI Summary

This paper introduces DAMP, a weight-surgery method for class unlearning that removes forget-specific directions from a pretrained network in a single shot. DAMP computes class prototypes at each layer, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update with depth-aware scaling. Experiments on multiple datasets and architectures show DAMP achieves better selective forgetting, preserves retain-class performance, and reduces residual forget-class structure compared to prior methods, more closely resembling retraining.

Key Contribution

Existing class unlearning methods often just suppress the classifier head, but DAMP actually removes the forgotten knowledge from internal representations by surgically editing network weights.

Abstract

Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

Related Papers