Search papers, labs, and topics across Lattice.
B, the proposed detector achieves over 97% detection accuracy with less than 2% false positives. This work demonstrates that backdoor behaviors leave identifiable spectral signatures in parameter-efficient adaptations, and that weight-space analysis provides a principled and practical alternative to execution-based defenses. More broadly, our results position geometric analysis of adapter weights as a promising direction for securing the emerging ecosystem of reusable PEFT components in large language models. Future work includes studying adaptive adversaries, eliminating the reference bank dependency, and validating across diverse architectures. References B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2018) Detecting backdoor attacks on deep neural networks by activation clustering. External Links: 1811.03728, Link Cited by: §2. T. Gu, B. Dolan-Gavitt, and S. Garg (2019) BadNets: identifying vulnerabilities in the machine learning model supply chain. External Links: 1708.06733, Link Cited by: §1, §2. HF (2026) Note: Accessed: February 3, 2026 External Links: Link Cited by: Weight space Detection of Backdoors in LoRA Adapters. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen (2021) LoRA: low-rank adaptation of large language models. External Links: 2106.09685, Link Cited by: §1, §2. Z. Huang, N. Z. Gong, and M. K. Reiter (2025) A general framework for data-use auditing of ML models. External Links: 2407.15100, Link Cited by: §1. K. Kurita, P. Michel, and G. Neubig (2020) Weight poisoning attacks on pre-trained models. CoRR abs/2004.06660. External Links: Link, 2004.06660, Independent
3
0
7
Language models can be tricked into strategically tanking their performance with adversarially optimized prompts, revealing a major vulnerability in evaluation reliability.
Cutting LLMs' reasoning token budget can backfire spectacularly, tanking performance even below that of models with *no* reasoning at all.
Spot poisoned LoRA adapters without running them: a weight-space analysis achieves 97% accuracy in detecting backdoors, even when the trigger is unknown.