Search papers, labs, and topics across Lattice.
The paper introduces SPARD, a defense against harmful fine-tuning attacks that compromise LLM safety, by integrating Safety-Projected Alternating Gradient (SPAG) optimization with Relevance-Diversity Determinantal Point Process (RD-DPP) based safe data selection. SPAG alternates between utility updates and explicit safety projections using safe data, while RD-DPP curates a compact and effective safe dataset by balancing task relevance and safety coverage. Experiments show SPARD significantly reduces attack success rates compared to existing defenses, while preserving task accuracy on GSM8K and OpenBookQA.
LLM safety can be significantly bolstered against harmful fine-tuning attacks by strategically projecting models back into safe parameter space using a relevance- and diversity-aware curated dataset.
Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.