Search papers, labs, and topics across Lattice.
The paper analyzes the quality of AI safety datasets, revealing their over-reliance on "triggering cues" rather than genuine adversarial intent. They introduce "intent laundering," a technique to remove these cues while preserving malicious intent, and demonstrate that models deemed safe by existing datasets become vulnerable after applying this technique. The study highlights a critical gap between current safety evaluations and real-world adversarial scenarios, showing that models like Gemini 3 Pro and Claude Sonnet 3.7 are easily jailbroken once triggering cues are removed.
Stripping away obvious "triggering cues" from adversarial attacks reveals that current AI safety datasets drastically overestimate model robustness, turning "safe" models like Gemini 3 Pro and Claude Sonnet 3.7 into easy targets.
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three key properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on"triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce"intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated"reasonably safe"models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated by existing datasets and how real-world adversaries behave.