Search papers, labs, and topics across Lattice.
This paper formally proves a "defense trilemma" showing that continuous, utility-preserving input wrappers cannot guarantee complete safety for language models with connected prompt spaces. The authors establish three results showing that such defenses must either leave some inputs unchanged (boundary fixation), create near-threshold unsafe regions (epsilon-robust constraint), or allow persistent unsafe regions (transversality condition). The theory is verified in Lean 4 and empirically validated on three LLMs, highlighting fundamental limitations of wrapper-based prompt injection defenses.
Input wrappers meant to defend against prompt injection are fundamentally limited: you can't have continuity, utility, and complete safety, no matter how clever the wrapper.
We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\epsilon$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.