Anthropic

×Red-Teaming & Adversarial Robustness

1 paper from Anthropic on Red-Teaming & Adversarial Robustness

Apr 16, 2026

AnthropicApr 16, 2026

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

LLM safety probes can be made significantly more robust to adversarial attacks by requiring consistent evidence across token segments, not just isolated spikes.

Xuanli He, Xuanli He, Bilgehan Sel +9

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Search

Anthropic