GenerativeAI Academic Research Team (GART)Silesian University of TechnologyWrocław University of Science and TechnologyMay 5, 2026

Comprehensive Analysis of LLM Guardrails Approaches Preventing Harmful Content and Jailbreak Attacks

Paweł Majewski, Kacper Marczak, Nina Dubicka, Jedrzej Podolak, Bartosz Sochaj, Jakub Siłka, Marek Kowal

AI Summary

This paper benchmarks six different guardrail methods across three LLMs (Mistral Large, Llama 3, Claude 3.5) using 13 datasets to evaluate their effectiveness in preventing harmful content generation and jailbreak attacks. The study finds that cloud-based solutions like AWS Guardrails and NeMo achieve the highest accuracy in blocking harmful content while minimizing excessive blocking of neutral prompts. The results highlight the necessity of implementing guardrails in commercial LLM applications to mitigate the risk of jailbreak attacks.

Key Contribution

AWS Guardrails and NeMo stand out, achieving 96.8% and 93.9% accuracy respectively, proving that effective defenses against jailbreaks are within reach for commercial LLM deployments.

Abstract

LLMs (Large Language Models) have become increasingly important, with chatbots being widely used in commercial settings to assist employees and answer customer questions. To protect a company’s reputation and ensure compliance, it’s crucial that chatbots do not generate harmful content, even in the face of deliberate jailbreak attacks. Researchers propose various methods to secure LLMs, known as guardrails, to prevent harmful content generation and jailbreak attacks. This article aims to comprehensively analyze existing guardrail solutions and provide guidelines for selecting the optimal solution for specific scenarios. The study compared six different guardrail methods across three versions of LLMs (Mistral Large 24.02, Meta Llama 3-8B Instruct, Anthropic Claude 3.5 Sonnet), including two baseline approaches, two cloud-based solutions (AWS Guardrails, Azure AI Content Safety), and two other popular non-cloud solutions (NeMo by Nvidia and Llama Guard by Meta). Thirteen datasets were used for evaluation: ten representing harmful questions in jailbreak attacks and three with neutral prompts similar to harmful questions to check for excessive blocking. The best results were achieved by AWS Guardrails (averaged accuracy across models 96.8%) and NeMo (93.9%) The results clearly showed that using guardrails is essential when building commercial applications based on LLMs due to advancements in effective jailbreak attacks.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References45

Year2026

VenueIEEE Access

Related Papers

Finding related papers...

Search

Comprehensive Analysis of LLM Guardrails Approaches Preventing Harmful Content and Jailbreak Attacks

Related Papers