Search papers, labs, and topics across Lattice.
This paper introduces JailbreakOPT, a novel framework that enhances iterative jailbreak prompt optimization for large language models (LLMs) by organizing diverse atomic prompts into a tool library and employing a unified optimization strategy. By framing tool selection as a contextual bandit problem and utilizing contextual Thompson sampling, JailbreakOPT effectively balances exploration and exploitation, leading to improved attack success rates. Experimental results demonstrate that JailbreakOPT significantly outperforms traditional single-turn attacks and existing iterative methods, achieving higher success rates with fewer attempts.
JailbreakOPT boosts attack success rates against LLMs while slashing the number of attempts needed to breach safety measures.
Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.