Search papers, labs, and topics across Lattice.
This paper introduces UNIATTACK, an adversarial testing framework that constructs effective black-box attack prompts against large language models (LLMs) by extracting and optimizing minimal yet impactful attack features from existing adversarial strategies. By employing a specialized attacker LLM, UNIATTACK generates flexible templates that facilitate one-shot attacks, demonstrating significant improvements in attack success rates across various models and defense strategies. The results reveal an average attack success rate improvement of 64.63% to 248.82% while maintaining a cost efficiency of only 0.03% to 4.96% compared to baseline methods.
UNIATTACK achieves up to 248% higher attack success rates against LLMs with multi-layered defenses, revolutionizing adversarial testing efficiency.
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63\%-248.82\% on models deployed with multi-layered defense mechanisms and it only takes 0.03\%-4.96\% cost of the baselines. UNIATTACK artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.