Mar 2, 2026arXiv:2603.01938

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Yanhui Chen, Shanshan Lin, Dongsheng Hong, Shu Wu, Xiangwen Liao, Chuanyi Liu

AI Summary

This paper introduces Explanation-Guided Adversarial Training (EGAT), a novel framework that combines adversarial training (AT) with explanation-guided learning (EGL) to improve model robustness, interpretability, and accuracy. EGAT generates adversarial examples while simultaneously enforcing explanation-based constraints, thereby encouraging the model to rely on semantically meaningful features for its decisions. Experiments on out-of-distribution datasets demonstrate that EGAT significantly outperforms existing methods in both clean and adversarial accuracy, while also generating more interpretable explanations with only a modest increase in training time.

Key Contribution

Get robust and interpretable models by combining adversarial training with explanation-guided learning, achieving a whopping +37% boost in adversarial accuracy on OOD data.

Abstract

Deep neural networks (DNNs) have achieved remarkable performance in many tasks, yet they often behave as opaque black boxes. Explanation-guided learning (EGL) methods steer DNNs using human-provided explanations or supervision on model attributions. These approaches improve interpretability but typically assume benign inputs and incur heavy annotation costs. In contrast, both predictions and saliency maps of DNNs could dramatically alter facing imperceptible perturbations or unseen patterns. Adversarial training (AT) can substantially improve robustness, but it does not guarantee that model decisions rely on semantically meaningful features. In response, we propose Explanation-Guided Adversarial Training (EGAT), a unified framework that integrates the strength of AT and EGL to simultaneously improve prediction performance, robustness, and explanation quality. EGAT generates adversarial examples on the fly while imposing explanation-based constraints on the model. By jointly optimizing classification performance, adversarial robustness, and attributional stability, EGAT is not only more resistant to unexpected cases, including adversarial attacks and out-of-distribution (OOD) scenarios, but also offer human-interpretable justifications for the decisions. We further formalize EGAT within the Probably Approximately Correct learning framework, demonstrating theoretically that it yields more stable predictions under unexpected situations compared to standard AT. Empirical evaluations on OOD benchmark datasets show that EGAT consistently outperforms competitive baselines in both clean accuracy and adversarial accuracy +37% while producing more semantically meaningful explanations, and requiring only a limited increase +16% in training time.

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Related Papers