TU MunichFeb 16, 2026arXiv:2602.15238

Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn

AI Summary

The paper identifies a distribution gap in adversarial training for LLMs, where models are vulnerable to simple in-distribution attacks due to inadequate coverage of the data distribution during training. To address this, they introduce Distributional Adversarial Training (DAT), which uses Diffusion LLMs to approximate the true joint distribution of prompts and responses, generating diverse, high-likelihood samples. DAT combines optimization over the data distribution provided by the diffusion model with continuous adversarial training, leading to significantly improved adversarial robustness.

Key Contribution

LLMs can still be easily fooled by simple prompt rewrites because current adversarial training doesn't adequately cover the data distribution, but a new method using diffusion models closes this gap.

Abstract

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

Natural Language Processing Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Closing the Distribution Gap in Adversarial Training for LLMs

Related Papers