Mar 18, 2026arXiv:2603.17372

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Zhihua Wei, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen

AI Summary

This paper investigates jailbreaking vulnerabilities in VLMs, finding that visual inputs shift internal representations towards a distinct "jailbreak state" even when the model recognizes harmful intent. They quantify this shift using a "jailbreak-related shift" (JRS) metric, demonstrating its ability to characterize and predict jailbreak behavior across different scenarios. Based on this analysis, they propose JRS-Rem, a defense mechanism that mitigates jailbreaks by removing the jailbreak-related shift at inference time, achieving strong defense without sacrificing performance on benign tasks.

Key Contribution

VLMs don't fail to *recognize* harmful intent when jailbroken; instead, visual inputs *shift* their internal representations into a distinct "jailbreak state," opening a new avenue for defense.

Abstract

Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Related Papers