KAUSTMBZUAIProvable Responsible AI and DataMar 9, 2026arXiv:2603.08486

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

AI Summary

The paper introduces Visual Self-Fulfilling Alignment (VSFA), a novel method for aligning vision-language models (VLMs) by fine-tuning them on neutral Visual Question Answering (VQA) tasks constructed around threat-related images. This approach leverages the self-fulfilling mechanism to instill vigilance and caution in VLMs without requiring explicit safety labels. Experiments demonstrate that VSFA effectively reduces attack success rates, improves response quality, and mitigates over-refusal, while maintaining general capabilities across multiple VLMs and safety benchmarks.

Key Contribution

Fine-tuning VLMs on threat-related images alone can significantly improve safety without any explicit safety labels, revealing a surprising visual pathway for alignment.

Abstract

Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.

Constitutional AI & AI Ethics Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Related Papers