AccentureD slices by indexing into the respectiveRutgersJan 12, 2026arXiv:2601.08000

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

Can Jin, Rui Wu, Tong Che, Qixin Zhang, Hongwu Peng, Jiahui Zhao, Zhenting Wang, Wenqi Wei, Ligong Han, Zhao Zhang, Yuan Cao, Ruixiang Tang, Dimitris N. Metaxas

AI Summary

The paper investigates the effectiveness of deliberative alignment (DA) using explicit safety codes versus case-augmented examples for improving LLM safety. They find that explicit safety codes lead to inconsistent harmlessness and degraded helpfulness, while case-augmented simple codes result in more robust safety behaviors. Based on these findings, they propose CADA, a case-augmented deliberative alignment method using reinforcement learning on self-generated safety reasoning chains, which improves harmlessness, robustness, and utility.

Key Contribution

Ditch the rigid safety codes: case-augmented reasoning unlocks safer, more helpful LLMs that are also more robust to attacks.

Abstract

Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like''safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

Related Papers