Haoyu Wang

Tony, N\mathcal{D}=\{x_{i},y_{i}\}^{N}_{i=1} where yiy_{i} belongs to label space 𝒴={1,⋯,K}\mathcal{Y}=\{1,\cdots,K\}, the objective of machine learning can be formularized as: θ=arg⁡minθ∑(xi,yi)∈𝒟ℒ(fθ(xi),yi)\theta=\arg\min_{\theta}\sum_{(x_{i},y_{i})\in\mathcal{D}}\mathcal{L}(f_{\theta}(x_{i}),y_{i}) (1) where fθf_{\theta} is the DNN model parameterized by θ\theta, and ℒ\mathcal{L} denotes the training loss function. In the scenario of machine unlearning, the goal is to remove the influence of a designated forget set 𝒟f⊂𝒟\mathcal{D}_{f}\subset\mathcal{D} from the pre-trained model fθf_{\theta}, while preserving its performance on the retain subset 𝒟r=𝒟∖𝒟f\mathcal{D}_{r}=\mathcal{D}\setminus\mathcal{D}_{f}. This process yields the unlearned model fθ′f_{\theta^{\prime}}, where θ′\theta^{\prime} represents the updated parameters. Building upon prior studies (Alzubaidi et al., 2021; Allen-Zhu and Li, 2023), deep learning models are widely recognized to process and interpret inputs in a hierarchical manner, wherein progressively deeper layers encode increasingly abstract and semantically rich representations. As the network depth increases, model parameters are able to capture more complex feature compositions that correspond to higher-level semantic concepts. Motivated by this perspective, we hypothesize that during pre-training, the model learns a series of feature patterns and their corresponding mappings to semantic concepts. This process can be formalized as 𝒫k↦𝒞k,k∈[1,K]\mathcal{P}_{k}\mapsto\mathcal{C}_{k},k\in[1,K], where 𝒫k={pk1,⋯,pkM}\mathcal{P}_{k}=\{p_{k_{1}},\cdots,p_{k_{M}}\} denotes the features of the semantic concept 𝒞k\mathcal{C}_{k} from class kk. As shown in Figure 1, different concepts may share similar feature patterns, which add to the complexity of the unlearning process. Thereby, we further straighten out such connections among concepts. For two distinct concepts 𝒞i\mathcal{C}_{i} and 𝒞j\mathcal{C}_{j}, we define the intersection of their feature patterns 𝒫ijass=𝒫i∩𝒫j\mathcal{P}^{ass}_{ij}=\mathcal{P}_{i}\cap\mathcal{P}_{j} as the associated features, while the unique features of concept 𝒞i\mathcal{C}_{i} and 𝒞j\mathcal{C}_{j} are defined as 𝒫iuni=𝒫i∖𝒫ijass\mathcal{P}^{uni}_{i}=\mathcal{P}_{i}\setminus\mathcal{P}^{ass}_{ij} and 𝒫juni=𝒫j∖𝒫ijass\mathcal{P}^{uni}_{j}=\mathcal{P}_{j}\setminus\mathcal{P}^{ass}_{ij} other than the shared part. Unlearning via training with error-maximizing noise The previous method, UNSIR, endeavors to achieve unlearning through deliberately inducing catastrophic forgetting about specific target data. This process is achieved through a two-step paradigm: θimpair′=arg⁡minθ′∑(xi,yi)∈𝒟rℒ(fθ(xi+𝒩f),yi)θrepair′=arg⁡minθ′∑(xi,yi)∈𝒟rℒ(fθ(xi),yi)\begin{array}[]{l}\theta^{\prime}_{impair}=\arg\min_{\theta^{\prime}}\sum_{(x_{i},y_{i})\in\mathcal{D}_{r}}\mathcal{L}(f_{\theta}(x_{i}+\mathcal{N}_{f}),y_{i})\\ \theta^{\prime}_{repair}=\arg\min_{\theta^{\prime}}\sum_{(x_{i},y_{i})\in\mathcal{D}_{r}}\mathcal{L}(f_{\theta}(x_{i}),y_{i})\end{array} (2) where 𝒩f\mathcal{N}_{f} denotes Error-maximizing noise. It is a matrix of the same size as that of the model input, randomly initialized and then trained from the frozen pretrained model through the process: 𝒩f=arg⁡min𝒩⁡𝔼(−ℒ(fθ(𝒩),yip)+λ‖wnoise‖)\mathcal{N}_{f}=\arg\min_{\mathcal{N}}\mathbb{E}(-\mathcal{L}(f_{\theta}(\mathcal{N}),y^{p}_{i})+\lambda\|w_{noise}\|) (3) In this manner, 𝒩f\mathcal{N}_{f} extracts counteracting feature information that is semantically opposite to the target data from the frozen pretrained model. Under the classification loss ℒ\mathcal{L}, these anti-feature representations can induce gradient update directions that oppose to those driven by the target data. Consequently, during the impair stage in Equation 2, when 𝒩f\mathcal{N}_{f} is incorporated together with the retain data 𝒟r\mathcal{D}_{r} and optimized with respect to yiy_{i}, the model’s originally well-aligned optimization trajectory for target-related features is deliberately disrupted. As a result, the model’s ability to accurately capture and exploit features associated with the target data is significantly degraded, thereby effectively realizing concept-level forgetting of the target data. However, this approach does not account for feature entanglement across different semantic concepts. The error-maximizing noise indiscriminately extracts all anti-feature representations associated with the target data. Directly using these representations for model fine-tuning without proper selection or disentanglement risks disrupting the model’s understanding of features relevant to the retained data. Moreover, this strategy operates solely at the level of feature representations. Owing to the absence of explicit manipulation of the target data’s semantic concepts during the unlearning stage, the mapping from feature patterns to the target semantic concept may still persist in the model’s latent space, ultimately resulting in incomplete and insufficient forgetting. 4 Methodology In this section, we introduce MeGU for effective machine unlearning, which combines two components: 1) the MLLM guided label perturbation and 2) the Fragment-Align strategy. As shown in Figure 2, we first estimate inter-concept similarities using zero-shot MLLMs on a subset of training data, and cache them in a lightweight transition matrix. Further, based on the pretrained model’s predictions, the transition matrix can assign proper label perturbation to achieve manipulation of concepts. For each forgetting instance, the pre-trained and cached feature noises are injected to disentangle their influence. Furthermore, we utilize MLLM guidance to quantify inter-concept distances to ensure effective label perturbation and identification of the class for positive noise, thereby preserving the retained data. In the following sections, we first demonstrate the process to generate perturbing labels in Section 4.1. Then, in Section 4.2, we carefully elucidate the Fragment-Align strategy for influence disentanglement and the fine-tuning method to unlearn target data. 4.1 Perturbing labels generation Intuitively, to induce model unlearning, MeGU introduces semantically consistent but incorrect labels at the finetuning stage to replace the learned connections of target data to their correct labels with a meaningful but alternative relation to the retained concepts. To achieve this, we leverage MLLM to estimate the semantic consistency among concepts to assign suitable perturbing labels. To reduce the computation costs, we use a light-weight transition matrix to capture inter-concept similarities derived by the MLLM using a small subset from the training data. This matrix, encoding the knowledge of the MLLM, can be repetitively used for perturbing label assignment without further MLLM calls. Transition matrix estimation Specifically, let qwq_{w} be the MLLM model parameterized by ww. We randomly select nn exemplars from each class in the dataset 𝒟\mathcal{D} to construct the subset 𝒟ex\mathcal{D}_{ex}. Then we prompt the MLLM model qwq_{w} to estimate feature similarity of each instance in 𝒟ex\mathcal{D}_{ex} to all other semantic concepts. For example, given a query image from class kk and its counterpart l∈[1,K]l\in[1,K], the prompt is: Question: This image <<IMGl>> shows a photo of <<label>l{}_{l}>, True or False? Answer: True; Question: This image <<IMGp>> shows a photo of <<label>l{}_{l}>, True or False? Answer: False; Question: This image <<IMGkquery{}^{query}_{k}>> shows a photo of <<label>l{}_{l}>, True or False? Answer:

Papers on Lattice

Total citations

Topics

h-index

Research focus

Constitutional AI & AI Ethics (1)Eval Frameworks & Benchmarks (1)Red-Teaming & Adversarial Robustness (1)

Frequent co-authors

Yedi Zhang (1)Yedi Zhang (1)Xianglin Yang (1)Jin Song Dong (1)

Papers (1)

Feb 23, 2026

NUSFeb 23, 2026

LLM-enabled Applications Require System-Level Threat Monitoring

The trustworthiness of LLM-enabled applications hinges not on further model improvements, but on establishing system-level threat monitoring to detect post-deployment anomalies.

Yedi Zhang, Yedi Zhang, Haoyu Wang +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Search

Haoyu Wang

Research focus

Frequent co-authors

Papers (1)