Institute of Science TokyoNYCUMay 28, 2026arXiv:2605.29354

Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills

Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Jun Sakuma

AI Summary

The paper introduces Neutral Prompting Attacks (NPA), a novel method to subtly increase package hallucination in LLM-powered coding agents by using semantically benign instructions like encouraging imagination. NPA shifts the model's dependency generation towards more speculative package names, increasing both Hallucination Attack Success Rate (ASR) and Pip Install ASR. Experiments across multiple coding-oriented LLMs demonstrate that NPA effectively evades existing defenses, highlighting a significant software supply chain risk.

Key Contribution

Seemingly harmless prompts like "imagine all possibilities" can covertly steer LLMs to hallucinate software packages, creating a stealthy attack vector that bypasses existing defenses.

Abstract

LLM-powered coding agents increasingly participate in software development workflows by generating code, selecting dependencies, and producing package installation commands. This creates a new software supply chain risk: when an agent hallucinates a non-existent package, an attacker may register the hallucinated name and later compromise users who install it. Existing package hallucination attacks and defenses primarily focus on naturally occurring hallucinations, targeted dependency steering, or post-hoc package validation. In this paper, we introduce \emph{Neutral Prompting Attack} (NPA), a highly stealthy attack paradigm in which semantically benign instructions, such as encouraging imagination and exhaustiveness, increase package hallucination propensity without containing explicit malicious intent. Unlike targeted dependency steering, NPA does not specify an attacker-chosen package. Instead, it shifts the model's dependency generation behavior toward more speculative package names. We evaluate NPA across multiple coding-oriented LLMs and package hallucination benchmarks. Our results show that NPA increases both \emph{Hallucination ASR} and \emph{Pip Install ASR}, changes the distribution of hallucinated package names, and evades existing static-analysis, LLM-based, and agent-based Skill defenses. These findings reveal that harmless-looking prompts can covertly manipulate hallucination behavior and create downstream software supply chain risks.

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills

Related Papers