Search papers, labs, and topics across Lattice.
ProjLens, an interpretability framework, is introduced to analyze backdoor vulnerabilities in Multimodal Large Language Models (MLLMs), specifically focusing on projector fine-tuning. The study reveals that even normal downstream task alignment via projector fine-tuning introduces backdoor vulnerabilities with distinct activation mechanisms compared to text-only LLMs. Key findings include the identification of a low-rank subspace within the projector encoding backdoor-critical parameters and a semantic shift towards a shared direction aligned with the backdoor target, with the shifting magnitude scaling linearly with the input norm.
Projector fine-tuning, commonly used for aligning MLLMs, unexpectedly introduces backdoor vulnerabilities with activation mechanisms distinct from those in text-only LLMs.
Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7