OpenAIHangzhou High-Tech Zone (Binjiang)HuggingFaceMar 11, 2026arXiv:2603.10749

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Yu He, Haozhe Zhu, Shuo Shao, H. Yao, Zhihao Liu, Zhan Qin

AI Summary

The paper introduces AttriGuard, a runtime defense mechanism against Indirect Prompt Injection (IPI) attacks in LLM agents, which leverages action-level causal attribution to distinguish between tool calls driven by user intent versus malicious external observations. AttriGuard employs parallel counterfactual tests, teacher-forced shadow replay, and hierarchical control attenuation to verify the necessity of each tool call by re-executing the agent under modified external observations. Experiments across four LLMs and two agent benchmarks demonstrate AttriGuard achieves 0% Attack Success Rate (ASR) under static attacks with negligible utility loss and maintains resilience under adaptive attacks where other defenses fail.

Key Contribution

By pinpointing the causal origins of tool use, AttriGuard neutralizes indirect prompt injection attacks that can hijack LLM agents, even when faced with adversarial optimization.

Abstract

LLM agents are highly vulnerable to Indirect Prompt Injection (IPI), where adversaries embed malicious directives in untrusted tool outputs to hijack execution. Most existing defenses treat IPI as an input-level semantic discrimination problem, which often fails to generalize to unseen payloads. We propose a new paradigm, action-level causal attribution, which secures agents by asking why a particular tool call is produced. The central goal is to distinguish tool calls supported by the user's intent from those causally driven by untrusted observations. We instantiate this paradigm with AttriGuard, a runtime defense based on parallel counterfactual tests. For each proposed tool call, AttriGuard verifies its necessity by re-executing the agent under a control-attenuated view of external observations. Technically, AttriGuard combines teacher-forced shadow replay to prevent attribution confounding, hierarchical control attenuation to suppress diverse control channels while preserving task-relevant information, and a fuzzy survival criterion that is robust to LLM stochasticity. Across four LLMs and two agent benchmarks, AttriGuard achieves 0% ASR under static attacks with negligible utility loss and moderate overhead. Importantly, it remains resilient under adaptive optimization-based attacks in settings where leading defenses degrade significantly.

Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References67

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AttriGuard: Defeating Indirect Prompt Injection in LLM Agents via Causal Attribution of Tool Invocations

Related Papers