Shanghai Jiaotong UniversitySJTUMay 26, 2026arXiv:2605.26574

GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

Haodong Zhao, Tianyi Xu, Tianhang Zhao, Zhuosheng Zhang, Gongshen Liu

AI Summary

This paper introduces GradSentry, a novel backdoor defense for LLM fine-tuning that filters poisoned samples based on the spectral entropy of their gradients. The core insight is that poisoned samples exhibit higher gradient spectral entropy compared to clean samples, allowing for effective discrimination without clustering. GradSentry demonstrates strong performance across various poison ratios, datasets, and attack types with minimal computational overhead, making it a practical defense mechanism.

Key Contribution

Poisoned training data leaves a unique fingerprint in the spectral entropy of LLM gradients, enabling backdoor detection even at extreme poison ratios where clustering-based defenses fail.

Abstract

Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.

Data Curation & Synthetic Data Natural Language Processing Red-Teaming & Adversarial Robustness Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning

Related Papers