Search papers, labs, and topics across Lattice.
The paper demonstrates that encoding harmful prompts as coherent mathematical problems bypasses LLM safety filters at a high rate (46%-56% attack success across eight models). The attack's effectiveness hinges on a helper LLM's ability to deeply reformulate harmful content into a genuine mathematical problem, not just applying mathematical notation. The authors introduce a novel Formal Logic encoding, achieving attack success comparable to Set Theory, and show that newer models exhibit greater robustness but remain vulnerable.
LLM safety filters, which rely on semantic pattern matching, can be bypassed at scale by encoding harmful prompts as coherent mathematical problems, revealing a fundamental vulnerability.
Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks. Crucially, the effectiveness depends not on mathematical notation itself, but on whether a helper LLM deeply reformulates the harmful content into a genuine mathematical problem: rule-based encodings that apply mathematical formatting without such reformulation perform no better than unencoded baselines. We introduce a novel Formal Logic encoding that achieves attack success comparable to Set Theory, demonstrating that this vulnerability generalizes across mathematical formalisms. Additional experiments with repeat post-processing confirm that these attacks are robust to simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) show substantially greater robustness than older models, though they remain vulnerable. Our findings highlight fundamental gaps in current safety frameworks and motivate defenses that reason about mathematical structure rather than surface-level semantics.