Icaro FoundationSant'Anna School of Advanced StudiesSapienzaApr 20, 2026arXiv:2604.18487

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Daniele Nardi

AI Summary

The Adversarial Humanities Benchmark (AHB) was introduced to assess the stylistic robustness of safety refusals in frontier language models by rephrasing harmful prompts using humanities-style transformations. Results across 31 frontier models show a significant increase in attack success rate (ASR) from 3.84% on original attacks to an average of 55.75% on transformed attacks. This highlights a critical vulnerability: current safety mechanisms exhibit poor generalization across stylistic variations of harmful prompts, particularly in high-risk categories like CBRN.

Key Contribution

Frontier model safety crumbles when harmful prompts are rephrased with humanities-style transformations, revealing a profound lack of stylistic robustness.

Abstract

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Related Papers