Apr 21, 2026arXiv:2604.19083

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Kun Wang, Cheng Qian, Cheng Qian, Miao Yu, Lilan Peng, Liang Lin, Jiaming Zhang, Tianyu Zhang, Yu Cheng, Yang Wang

AI Summary

ProjLens, an interpretability framework, is introduced to analyze backdoor vulnerabilities in Multimodal Large Language Models (MLLMs), specifically focusing on projector fine-tuning. The study reveals that even normal downstream task alignment via projector fine-tuning introduces backdoor vulnerabilities with distinct activation mechanisms compared to text-only LLMs. Key findings include the identification of a low-rank subspace within the projector encoding backdoor-critical parameters and a semantic shift towards a shared direction aligned with the backdoor target, with the shifting magnitude scaling linearly with the input norm.

Key Contribution

Projector fine-tuning, commonly used for aligning MLLMs, unexpectedly introduces backdoor vulnerabilities with activation mechanisms distinct from those in text-only LLMs.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7

Interpretability & Mechanistic Interp Multimodal Models Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References72

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Related Papers