ArizonaCalifornia State University NorthridgeCornellIIScUniversity of CincinnatiUniversity of the CumberlandsUVAJan 29, 2026arXiv:2601.22136

StepShield: When, Not Whether to Intervene on Rogue Agents

Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar, Milan Hussain Angati, Sandeep Bandarupalli

AI Summary

The paper introduces StepShield, a new benchmark for evaluating the temporal aspect of detecting rogue behaviors in code-generating AI agents, moving beyond binary accuracy to assess *when* violations are detected. StepShield comprises a dataset of 9,213 code agent trajectories with annotated rogue behaviors grounded in real-world security incidents and proposes three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Evaluations using StepShield reveal significant performance differences between detectors (e.g., LLM-based vs. static analyzers) that are masked by standard accuracy metrics, and demonstrate the economic benefits of early detection through a cascaded detector.

Key Contribution

LLM-based judges can detect rogue agent behaviors 2.3x earlier than static analyzers, a difference completely masked by standard accuracy metrics.

Abstract

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

StepShield: When, Not Whether to Intervene on Rogue Agents

Related Papers