DEFENSE LOW NEW

AgentTrust: vetting agent tool calls before they execute

A preprint from May 6, 2026 introduces AgentTrust, a runtime layer that vets each agent tool call before it runs and returns allow/warn/block/review — catching obfuscated shell payloads static guards miss.

2026-06-08 // 6 min

What is this?

On May 6, 2026, a preprint titled AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use (arXiv:2605.04785) proposed a defense for a problem the rest of the month kept illustrating: AI agents now perform real-world side effects — file operations, shell commands, HTTP requests, database queries — and a single unsafe action (an accidental deletion, a leaked credential, an exfiltrated file) can cause irreversible harm. AgentTrust sits between the agent and its tools and decides, before each call executes, whether to let it through.

The work was flagged in Adversa AI’s June 2026 agentic-security roundup as one of the month’s notable agent defenses. Its motivation is concrete: in the same window, Microsoft documented how a prompt injection can reach host-level remote code execution through an agent’s model-invokable functions. If injections can turn prompts into shells, the last line of defense is whatever inspects the action the agent is about to take.

How it works

AgentTrust is a runtime interception layer. Every tool call the agent attempts is paused, evaluated, and assigned one of four structured verdicts — allow, warn, block, or review — before it is allowed to run. The paper argues this fills a gap left by the three defenses people usually reach for, each of which the authors describe as incomplete on its own:

Existing control        What it does                      Where it falls short
----------------------  --------------------------------  ------------------------------------
Post-hoc benchmarks     Measure agent behavior            Judge after the action already ran
Static guardrails       Pattern-match inputs/outputs      Miss obfuscation and multi-step context
Infra sandboxes         Constrain WHERE code runs         Don't understand WHAT an action means

To close that gap, AgentTrust combines four components. A shell deobfuscation normalizer unwinds the tricks attackers use to hide a dangerous command from a naive pattern match — variable expansion, hex/octal escapes, alias resolution, command substitution, ANSI-C quoting, adjacent-quote concatenation — so the verdict is rendered against what the command actually does, not how it was spelled. SafeFix is a rule-driven engine that, rather than only blocking, proposes a safer alternative to a risky call. RiskChain looks across steps to catch multi-step attack chains that look benign call-by-call. And a cache-aware LLM-as-Judge handles the ambiguous inputs that rules cannot settle, with caching to keep latency low.

On the paper’s internal 300-scenario benchmark (six risk categories), the production-only ruleset reports 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond latency. On a separate set of 630 real-world adversarial scenarios — evaluated under a patched ruleset, and explicitly not claimed as zero-shot — it reports 96.7% verdict accuracy, including roughly 93% on shell-obfuscated payloads. Those numbers are the authors’ own; as with any single-paper evaluation, treat them as a starting point, not an independent guarantee.

Why it matters

The agent threat model has shifted from “what does the model say” to “what does the model do.” The disclosures stacking up through 2026 — coding-agent RCEs, prompt-injection-to-shell chains, poisoned tool and memory records — share a cause: an agent was trusted to take an action whose real effect nobody inspected. A layer that understands the meaning of a tool call, deobfuscates it, and can veto it is a direct answer to that class of failure.

It also matters that AgentTrust ships as a Model Context Protocol server under AGPL-3.0. That makes it droppable in front of MCP-compatible agents without rebuilding them, and it makes the deobfuscation rules auditable rather than a black box. The trade-off is the familiar one for any inline guard: every blocked legitimate action is friction, and a confident wrong “allow” is worse than no guard at all, so the verdict quality and the false-positive rate are what determine whether teams keep it switched on.

Defenses

AgentTrust is itself a defensive control. The practical takeaways for teams running tool-using agents:

Mediate tool calls, don’t just sandbox them. A sandbox limits where code runs; an action-level mediator decides whether a specific call should run at all. Use both — they cover different failure modes.
Normalize before you judge. Any allow/deny decision made on raw command text is one obfuscation trick away from being wrong. Deobfuscate shell input (variable expansion, hex/octal escapes, aliases, command substitution, quoting tricks) and evaluate the canonical form.
Reason across steps, not just per call. Multi-step chains can be individually innocuous and collectively an exfiltration. Keep enough context to catch the chain, not only the single call.
Prefer safer alternatives to hard blocks. A guard that only blocks gets disabled the first time it interrupts real work. Offering a safer rewrite (the SafeFix idea) preserves utility and keeps the guard switched on.
Keep a human in the loop for the “review” tier. Reserve a verdict for actions too consequential to auto-allow and too plausible to auto-block — irreversible deletes, credential access, outbound transfers — and route them to a person.
Measure your own false positives. Vendor or paper accuracy numbers are a starting point. Before you trust an inline guard in production, test it on your own traffic and watch what it wrongly blocks, because that is what decides whether it survives contact with your users.

Status

Item	Reference	Date	Notes
Preprint	arXiv:2605.04785	2026-05-06	Runtime interception; verdicts allow/warn/block/review; AGPL-3.0, ships an MCP server
Roundup mention	Adversa AI	2026-06-01	Listed under “Agentic AI defense”
Threat motivation	Microsoft Security Blog	2026-05-07	Prompt injection reaching host-level RCE via model-invokable functions

The headline is not “tool-call interception solves agent security.” It is narrower: once an agent can act, the action — not the prompt — is the boundary worth defending, and that boundary has to understand what a call means, not just how it is written. AgentTrust is one published, open-source attempt to make that boundary real; the reported numbers are the authors’ own, so validate it on your own traffic before relying on it.