Search papers, labs, and topics across Lattice.
This paper introduces Hidden State Poisoning Attacks (HiSPAs) that exploit vulnerabilities in Mamba-based language models by overwriting information in their hidden states, leading to a partial amnesia effect. The authors evaluate the impact of HiSPAs using the RoBench25 benchmark, demonstrating the susceptibility of SSMs, including a 52B Jamba model, to these attacks, unlike pure Transformers. Furthermore, they show that HiSPA triggers weaken the Jamba model on the Open-Prompt-Injections benchmark and provide an interpretability analysis of Mamba's hidden layers during attacks.
Mamba's efficiency comes at a cost: carefully crafted input phrases can induce "amnesia" by irreversibly poisoning its hidden states, a vulnerability Transformers don't share.
State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.