Apr 21, 2026arXiv:2604.19001

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Ishita Kakkar, Ishita Kakkar, Enze Zhang, Enze Zhang, Rheeya Uppaal

AI Summary

The paper introduces HarmThoughts, a new benchmark dataset for evaluating the safety of reasoning traces in large language models at a sentence level. It categorizes harmful behaviors into 16 types across four functional groups, focusing on how harm propagates during reasoning rather than just the final output. Experiments using HarmThoughts reveal that existing safety detectors struggle to identify fine-grained harmful behaviors within reasoning chains, especially those related to harm emergence and execution.

Key Contribution

Current safety filters miss the forest for the trees: they fail to detect the subtle, step-by-step progression of harm within reasoning chains, leaving models vulnerable to jailbreaks.

Abstract

Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces -- a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References90

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Related Papers