Zihan Guan

×10−61\times 10^{-6} yields the best overall performance. 5.7 Performance under prefix attack Table 10 presents the performance under the prefix attack, where we append “<<think><></think>>” to the end of the prompt. This modification is designed to prompt the LLM to omit the reasoning process, allowing us to assess whether it still maintains strong alignment capabilities. The results show that our method consistently preserves both advanced safety and utility performance, even under this adversarial setting. 6 Conclusion This paper investigates why current LLM alignment techniques often fail under jailbreak attacks. Through causal interventions, we show that the existing alignment methods rely on superficial refusal patterns rather than deep understanding. To address this, we introduce a long-form Chain-of-Thought (CoT) dataset and show that CoT fine-tuning improves both safety and utility. Building on the error pattern of COT finetuning, we propose Alignment-Weighted DPO (AW-DPO), a novel method that separately targets reasoning and response errors for fine-grained correction. Our experiments demonstrate that AW-DPO outperforms existing baselines in safety while preserving utility, offering a more robust approach to LLM alignment. Acknowledgements This work was supported by Capital One Bank. The authors thank the collaborators and reviewers for their valuable feedback. Ethics Statement LLMs have been widely used, achieving promising performance in various domains. Therefore, exploring the safety of LLMs is of great significance in practice. In this paper, we propose Alignment-Weighted DPO (AW-DPO), a novel method that separately targets reasoning and response errors for fine-grained correction. As described, we aim to enhance the safety of the existing LLMs; therefore, this paper has no ethical issues and will not introduce any additional security risks to LLMs. Reproducibility Statement For implementation details, please refer to Appendix A and H. We provide a CoT dataset at https://anonymous.4open.science/r/cot_safety_data-

Papers on Lattice

Total citations

Topics