Search papers, labs, and topics across Lattice.
This paper investigates the vulnerability of open-source LLMs to adversarial attacks using special characters, including unicode, homoglyph, structural, and textual encoding manipulations. The study evaluated seven models (3.8B-32B parameters) against a suite of 4,000+ attacks, revealing significant vulnerabilities in safety mechanisms across all tested model sizes. The findings highlight the susceptibility of these models to jailbreaks, incoherent outputs, and hallucinations when subjected to character-level adversarial perturbations.
LLMs, even those with tens of billions of parameters, are surprisingly susceptible to jailbreaking and generating incoherent outputs through simple special character adversarial attacks.
Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.