Anthropic

×Eval Frameworks & Benchmarks

1 paper from Anthropic on Eval Frameworks & Benchmarks

Apr 16, 2026

AnthropicApr 16, 2026

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

LLM safety probes can be made significantly more robust to adversarial attacks by requiring consistent evidence across token segments, not just isolated spikes.

Xuanli He, Bilgehan Sel, Bilgehan Sel +8

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Search

Anthropic