May 27, 2026arXiv:2605.28030

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, James T. Kwok, Yu Zhang

AI Summary

The paper introduces SPARD, a defense against harmful fine-tuning attacks that compromise LLM safety, by integrating Safety-Projected Alternating Gradient (SPAG) optimization with Relevance-Diversity Determinantal Point Process (RD-DPP) based safe data selection. SPAG alternates between utility updates and explicit safety projections using safe data, while RD-DPP curates a compact and effective safe dataset by balancing task relevance and safety coverage. Experiments show SPARD significantly reduces attack success rates compared to existing defenses, while preserving task accuracy on GSM8K and OpenBookQA.

Key Contribution

LLM safety can be significantly bolstered against harmful fine-tuning attacks by strategically projecting models back into safe parameter space using a relevance- and diversity-aware curated dataset.

Abstract

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

Constitutional AI & AI Ethics Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness

Citation Metrics

Citations3

Influential citations1

References62

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Related Papers