Search papers, labs, and topics across Lattice.
This paper empirically investigates the impact of intrinsic model characteristics and external attack techniques on the safety alignment of 32 LLMs and LRMs (3B-235B parameters) across 13 model families. The study uses 5 safety datasets, 56 jailbreak techniques, and 4 Chain-of-Thought (CoT) attack strategies, finding that models with integrated reasoning and self-reflection (GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B) exhibit the best safety alignment. The research also demonstrates that post-training and knowledge distillation can degrade safety alignment, and that CoT attacks using response prefixes significantly increase attack success rates, especially in text-completion interfaces.
User-defined response prefixes in LLMs are a major safety risk, enabling CoT attacks to achieve near-perfect success rates on some models.
This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.