May 6, 2026arXiv:2605.04785

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

AI Summary

AgentTrust introduces a runtime safety layer that intercepts AI agent tool calls to prevent unsafe real-world actions like data exfiltration or accidental deletion. It combines shell deobfuscation, SafeFix suggestions, RiskChain detection for multi-step attacks, and a cache-aware LLM-as-Judge to provide structured safety verdicts. Experiments on both internal and adversarial benchmarks demonstrate high verdict accuracy (95-97%) and risk-level accuracy (74%) with low latency, even against obfuscated shell commands.

Key Contribution

Stop waiting for AI agents to mess up: AgentTrust intercepts tool calls *before* execution, offering a chance to block, warn, or fix risky actions in real-time.

Abstract

Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Related Papers