Aug 12, 2025arXiv:2508.14070

Special-Character Adversarial Attacks on Open-Source Language Model

AI Summary

This paper investigates the vulnerability of open-source LLMs to adversarial attacks using special characters, including unicode, homoglyph, structural, and textual encoding manipulations. The study evaluated seven models (3.8B-32B parameters) against a suite of 4,000+ attacks, revealing significant vulnerabilities in safety mechanisms across all tested model sizes. The findings highlight the susceptibility of these models to jailbreaks, incoherent outputs, and hallucinations when subjected to character-level adversarial perturbations.

Key Contribution

LLMs, even those with tens of billions of parameters, are surprisingly susceptible to jailbreaking and generating incoherent outputs through simple special character adversarial attacks.

Abstract

Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.

Natural Language Processing Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Citation Metrics

Citations2

Influential citations0

References31

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Special-Character Adversarial Attacks on Open-Source Language Model

Related Papers