Apr 7, 2026arXiv:2604.06436

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, I. Habler, Ammar Al-Kahfah, Ken Huang, Blake Gatto

AI Summary

This paper formally proves a "defense trilemma" showing that continuous, utility-preserving input wrappers cannot guarantee complete safety for language models with connected prompt spaces. The authors establish three results showing that such defenses must either leave some inputs unchanged (boundary fixation), create near-threshold unsafe regions (epsilon-robust constraint), or allow persistent unsafe regions (transversality condition). The theory is verified in Lean 4 and empirically validated on three LLMs, highlighting fundamental limitations of wrapper-based prompt injection defenses.

Key Contribution

Input wrappers meant to defend against prompt injection are fundamentally limited: you can't have continuity, utility, and complete safety, no matter how clever the wrapper.

Abstract

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\epsilon$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References23

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Related Papers