Search papers, labs, and topics across Lattice.
The paper introduces JailBreakLLM, an 8B parameter LLaMa-3.1 model fine-tuned using LoRA on a novel Jailbreak FineTune Dataset to generate jailbreaking prompts for GPT-4o. The Jailbreak FineTune Dataset consists of 204 paired entries of malicious intents and handcrafted obfuscated prompts. JailBreakLLM achieves an 85% one-shot attack success rate on a held-out test set, demonstrating the vulnerability of GPT-4o's safety mechanisms to targeted attacks from smaller, open-source models.
A LLaMa-3.1 model with just 8 billion parameters can jailbreak OpenAI's GPT-4o with 85% success, highlighting the persistent vulnerability of even advanced LLMs.
Large Language Models (LLMs) such as OpenAI's GPT-40 achieve state-of-the-art performance across many NLP tasks, yet remain vulnerable to “jailbreak” attacks that subvert their safety alignments. In this work, we introduce lailbreakLLM, an 8 billion-parameter Meta LLaMa-3.1 model fine-tuned via LoRA on a novel Jailbreak FineTune Dataset, specifically to generate one-shot prompts that successfully bypass GPT -40's guardrails. Our dataset comprises 204 paired entries of high-level malicious intents (e.g., “how to create an explosive device”) and handcrafted obfuscated, role-play-style prompts designed to exploit weaknesses in GPT -40's content filters. We train with 4-bit quantization, a sequence length of 2048, LoRA rank=16, and standard PEFT hyperparameters, achieving a final loss of 0.77. On a held-out test set of 64 examples covering ten cate-gories-including violence, illicit behavior, cybersecurity, fraud, hate speech, and public safety-lailbreakLLM attains an 85% one-shot attack success rate, with only 6 % complete failures. We analyze category-wise performance, uncovering GPT -4o's greatest vulnerabilities in direct-harm and illicit-behavior domains. Our results demonstrate that a comparatively small, open-source model can effectively dismantle the safety mechanisms of a leading commercial LLM.