Search papers, labs, and topics across Lattice.
This paper investigates the deployment of safe large language models (LLMs) on resource-constrained edge devices, addressing the challenge posed by the high memory and computational demands of dual-model systems. By systematically evaluating various parameter-efficient safety alignment methods, the authors find that soft prompts combined with distillation techniques yield superior safety-usefulness trade-offs. Their proposed distillation frameworks, leveraging total variation and KL divergence, effectively transfer safety behaviors from guard models, establishing soft prompt distillation as the optimal approach for on-device LLM safety alignment.
Soft prompt distillation outperforms traditional safety alignment methods, achieving high safety without the heavy resource burden of dual-model systems.
Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.