Mar 3, 2026arXiv:2603.03081

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu

AI Summary

TAO-Attack is introduced as a novel optimization-based jailbreak method for LLMs that addresses limitations of existing approaches like frequent refusals and inefficient token updates. It uses a two-stage loss function to suppress refusals and penalize pseudo-harmful outputs, guiding the model towards genuinely harmful completions. The method also incorporates a direction-priority token optimization (DPTO) strategy to improve efficiency by aligning candidate tokens with the gradient direction.

Key Contribution

By prioritizing gradient direction in token optimization and using a two-stage loss, TAO-Attack achieves near-perfect jailbreak success rates against multiple LLMs, exposing critical vulnerabilities in current safety alignments.

Abstract

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

Related Papers