Search papers, labs, and topics across Lattice.
The paper introduces AttriGuard, a runtime defense mechanism against Indirect Prompt Injection (IPI) attacks in LLM agents, which leverages action-level causal attribution to distinguish between tool calls driven by user intent versus malicious external observations. AttriGuard employs parallel counterfactual tests, teacher-forced shadow replay, and hierarchical control attenuation to verify the necessity of each tool call by re-executing the agent under modified external observations. Experiments across four LLMs and two agent benchmarks demonstrate AttriGuard achieves 0% Attack Success Rate (ASR) under static attacks with negligible utility loss and maintains resilience under adaptive attacks where other defenses fail.
By pinpointing the causal origins of tool use, AttriGuard neutralizes indirect prompt injection attacks that can hijack LLM agents, even when faced with adversarial optimization.
LLM agents are highly vulnerable to Indirect Prompt Injection (IPI), where adversaries embed malicious directives in untrusted tool outputs to hijack execution. Most existing defenses treat IPI as an input-level semantic discrimination problem, which often fails to generalize to unseen payloads. We propose a new paradigm, action-level causal attribution, which secures agents by asking why a particular tool call is produced. The central goal is to distinguish tool calls supported by the user's intent from those causally driven by untrusted observations. We instantiate this paradigm with AttriGuard, a runtime defense based on parallel counterfactual tests. For each proposed tool call, AttriGuard verifies its necessity by re-executing the agent under a control-attenuated view of external observations. Technically, AttriGuard combines teacher-forced shadow replay to prevent attribution confounding, hierarchical control attenuation to suppress diverse control channels while preserving task-relevant information, and a fuzzy survival criterion that is robust to LLM stochasticity. Across four LLMs and two agent benchmarks, AttriGuard achieves 0% ASR under static attacks with negligible utility loss and moderate overhead. Importantly, it remains resilient under adaptive optimization-based attacks in settings where leading defenses degrade significantly.