May 5, 2026arXiv:2605.03441

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

AI Summary

The paper demonstrates that encoding harmful prompts as coherent mathematical problems bypasses LLM safety filters at a high rate (46%-56% attack success across eight models). The attack's effectiveness hinges on a helper LLM's ability to deeply reformulate harmful content into a genuine mathematical problem, not just applying mathematical notation. The authors introduce a novel Formal Logic encoding, achieving attack success comparable to Set Theory, and show that newer models exhibit greater robustness but remain vulnerable.

Key Contribution

LLM safety filters, which rely on semantic pattern matching, can be bypassed at scale by encoding harmful prompts as coherent mathematical problems, revealing a fundamental vulnerability.

Abstract

Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks. Crucially, the effectiveness depends not on mathematical notation itself, but on whether a helper LLM deeply reformulates the harmful content into a genuine mathematical problem: rule-based encodings that apply mathematical formatting without such reformulation perform no better than unencoded baselines. We introduce a novel Formal Logic encoding that achieves attack success comparable to Set Theory, demonstrating that this vulnerability generalizes across mathematical formalisms. Additional experiments with repeat post-processing confirm that these attacks are robust to simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) show substantially greater robustness than older models, though they remain vulnerable. Our findings highlight fundamental gaps in current safety frameworks and motivate defenses that reason about mathematical structure rather than surface-level semantics.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References21

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Related Papers