UMacauFeb 23, 2025arXiv:2502.16491

Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models

Yuyi Huang, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, Ailin Tao

AI Summary

The paper introduces novel priming attack strategies, inspired by psychological phenomena, to elicit harmful content generation from LLMs. These attacks, leveraging techniques like "Priming Effect", "Safe Attention Shift", and "Cognitive Dissonance", bypass the safety mechanisms of both open-source and closed-source models. The experiments demonstrate a 100% attack success rate on open-source models like Llama-3.2 and at least 95% on closed-source models like GPT-4o, highlighting significant vulnerabilities.

Key Contribution

LLM safety mechanisms are more vulnerable than we thought: psychological priming attacks achieve near-perfect success rates in eliciting harmful content across a wide range of models, including GPT-4o and Llama-3.2.

Abstract

Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the"Priming Effect","Safe Attention Shift", and"Cognitive Dissonance", effectively attack the models' guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta's Llama-3.2, Google's Gemma-2, Mistral's Mistral-NeMo, Falcon's Falcon-mamba, Apple's DCLM, Microsoft's Phi3, and Qwen's Qwen2.5, among others. Similarly, for closed-source models such as OpenAI's GPT-4o, Google's Gemini-1.5, and Claude-3.5, we observed an ASR of at least 95% on the AdvBench dataset, which represents the current state-of-the-art. This study underscores the urgent need to reassess the use of generative models in critical applications to mitigate potential adverse societal impacts.

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations1

Influential citations0

References44

Year2025

VenueNorth American Chapter of the Association for Computational Linguistics

Related Papers

Finding related papers...

Search

Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models

Related Papers