ECNUSJTUSouthamptonJun 15, 2026arXiv:2606.16751

Automated jailbreak attack targeting multiple defense strategies

Qi Wang, Chengcheng Wan, Weijia He, Yanqing Li, Hanqi Sun, Xiaodong Gu, Jiangtao Wang

AI Summary

This paper introduces UNIATTACK, an adversarial testing framework that constructs effective black-box attack prompts against large language models (LLMs) by extracting and optimizing minimal yet impactful attack features from existing adversarial strategies. By employing a specialized attacker LLM, UNIATTACK generates flexible templates that facilitate one-shot attacks, demonstrating significant improvements in attack success rates across various models and defense strategies. The results reveal an average attack success rate improvement of 64.63% to 248.82% while maintaining a cost efficiency of only 0.03% to 4.96% compared to baseline methods.

Key Contribution

UNIATTACK achieves up to 248% higher attack success rates against LLMs with multi-layered defenses, revolutionizing adversarial testing efficiency.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63\%-248.82\% on models deployed with multi-layered defense mechanisms and it only takes 0.03\%-4.96\% cost of the baselines. UNIATTACK artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Automated jailbreak attack targeting multiple defense strategies

Related Papers