Mar 30, 2026arXiv:2603.28013

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

AI Summary

This paper introduces a stage-decomposed analysis of prompt injection attacks, tracking the propagation of cryptographic canary tokens through four kill-chain stages (Exposed, Persisted, Relayed, Executed) across different attack surfaces and defense conditions. The key finding is that model safety hinges on preventing the propagation of adversarial content across pipeline stages, rather than simply detecting its initial exposure. Experiments on five frontier LLM agents reveal vulnerabilities in memory and tool-stream surfaces, and highlight the limitations of current defense mechanisms due to threat-model surface mismatch.

Key Contribution

Model safety isn't about whether adversarial content is seen, but whether it spreads: Claude strips injections at write_memory, while GPT-4o-mini propagates them flawlessly.

Abstract

We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Related Papers