Search papers, labs, and topics across Lattice.
The Adversarial Humanities Benchmark (AHB) was introduced to assess the stylistic robustness of safety refusals in frontier language models by rephrasing harmful prompts using humanities-style transformations. Results across 31 frontier models show a significant increase in attack success rate (ASR) from 3.84% on original attacks to an average of 55.75% on transformed attacks. This highlights a critical vulnerability: current safety mechanisms exhibit poor generalization across stylistic variations of harmful prompts, particularly in high-risk categories like CBRN.
Frontier model safety crumbles when harmful prompts are rephrased with humanities-style transformations, revealing a profound lack of stylistic robustness.
The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.