University of Liberal Arts BangladeshJul 31, 2025

JailBreakLLM: An Effective LLaMa Model Designed Specifically to Jailbreak OpenAI GPT

Rasel Ahmed Antor, Kazi Tamjeed Newaz, Sameer Mahmud, Mumtahina Tasnim Mahi, Nafees Mansour

AI Summary

The paper introduces JailBreakLLM, an 8B parameter LLaMa-3.1 model fine-tuned using LoRA on a novel Jailbreak FineTune Dataset to generate jailbreaking prompts for GPT-4o. The Jailbreak FineTune Dataset consists of 204 paired entries of malicious intents and handcrafted obfuscated prompts. JailBreakLLM achieves an 85% one-shot attack success rate on a held-out test set, demonstrating the vulnerability of GPT-4o's safety mechanisms to targeted attacks from smaller, open-source models.

Key Contribution

A LLaMa-3.1 model with just 8 billion parameters can jailbreak OpenAI's GPT-4o with 85% success, highlighting the persistent vulnerability of even advanced LLMs.

Abstract

Large Language Models (LLMs) such as OpenAI's GPT-40 achieve state-of-the-art performance across many NLP tasks, yet remain vulnerable to “jailbreak” attacks that subvert their safety alignments. In this work, we introduce lailbreakLLM, an 8 billion-parameter Meta LLaMa-3.1 model fine-tuned via LoRA on a novel Jailbreak FineTune Dataset, specifically to generate one-shot prompts that successfully bypass GPT -40's guardrails. Our dataset comprises 204 paired entries of high-level malicious intents (e.g., “how to create an explosive device”) and handcrafted obfuscated, role-play-style prompts designed to exploit weaknesses in GPT -40's content filters. We train with 4-bit quantization, a sequence length of 2048, LoRA rank=16, and standard PEFT hyperparameters, achieving a final loss of 0.77. On a held-out test set of 64 examples covering ten cate-gories-including violence, illicit behavior, cybersecurity, fraud, hate speech, and public safety-lailbreakLLM attains an 85% one-shot attack success rate, with only 6 % complete failures. We analyze category-wise performance, uncovering GPT -4o's greatest vulnerabilities in direct-harm and illicit-behavior domains. Our results demonstrate that a comparatively small, open-source model can effectively dismantle the safety mechanisms of a leading commercial LLM.

Constitutional AI & AI Ethics Open-Source Models & Weights Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References15

Year2025

Venue2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN)

Related Papers

Finding related papers...

Search

JailBreakLLM: An Effective LLaMa Model Designed Specifically to Jailbreak OpenAI GPT

Related Papers