Search papers, labs, and topics across Lattice.
The paper introduces Visual Self-Fulfilling Alignment (VSFA), a novel method for aligning vision-language models (VLMs) by fine-tuning them on neutral Visual Question Answering (VQA) tasks constructed around threat-related images. This approach leverages the self-fulfilling mechanism to instill vigilance and caution in VLMs without requiring explicit safety labels. Experiments demonstrate that VSFA effectively reduces attack success rates, improves response quality, and mitigates over-refusal, while maintaining general capabilities across multiple VLMs and safety benchmarks.
Fine-tuning VLMs on threat-related images alone can significantly improve safety without any explicit safety labels, revealing a surprising visual pathway for alignment.
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.