Mar 12, 2026arXiv:2603.12206

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

A. Mercier, Alexandre Le Mercier, Thomas Demeester, Chris Develder

AI Summary

The paper introduces CLASP, a defense mechanism against Hidden State Poisoning Attacks (HiSPAs) targeting state space models (SSMs) like Mamba. CLASP frames HiSPA mitigation as a token-level binary classification problem, leveraging an XGBoost classifier to identify malicious tokens based on patterns in Mamba's block output embeddings (BOEs). Experiments on a resume screening task demonstrate that CLASP achieves high F1 scores (95.9% token-level, 99.3% document-level) and generalizes to unseen attack patterns, while maintaining low computational overhead (1,032 tokens/second, <4GB VRAM).

Key Contribution

Mamba's memory can be defended against adversarial attacks with a lightweight XGBoost classifier that spots malicious tokens based on their embedding patterns, achieving >90% F1 even on novel attack structures.

Abstract

State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening r\'esum\'es to identify the best candidates for a role. Evaluated on a corpus of 2,483 r\'esum\'es totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Related Papers