Apr 28, 2026arXiv:2604.25102

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Ravikumar Balakrishnan, Sanket Mendapara

AI Summary

This paper investigates typographic prompt injection attacks on VLMs, finding that attack success rate is strongly correlated with the multimodal embedding distance between the image and text. They show that minimizing this embedding distance via $\ell_\infty$ perturbations increases attack success by both improving readability and bypassing safety alignment. The dominant failure mode depends on the VLM's safety filter strength and the degree of visual degradation.

Key Contribution

Cranking up the visual similarity between prompt images and text embeddings isn't just about readability for VLMs, it's a potent jailbreak that simultaneously unlocks readability and slips past safety filters.

Abstract

Typographic prompt injection exploits vision language models'(VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{<}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References10

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Related Papers