CASSchool of Computer ScienceApr 2, 2026arXiv:2604.01888

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Ahmed Mustafa, Zihan Ye, Yang Lu, Michael P. Pound, S. Gowda

AI Summary

This paper investigates the vulnerability of text-to-image models to jailbreak attacks using only natural language prompts, without requiring model access or adversarial training. They introduce a taxonomy of visual jailbreak techniques that exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. Experiments across state-of-the-art text-to-image systems reveal that simple linguistic modifications can reliably evade existing safeguards, achieving attack success rates up to 74.47%.

Key Contribution

Text-to-image safety filters are surprisingly easy to bypass: simple prompt reframing techniques achieve a 74% success rate in generating restricted imagery.

Abstract

Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.

Computer Vision Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Related Papers