system: OPERATIONAL
← back to all hacks
AGENTS MEDIUM NEW

Authority confusion: why tool-using agents misuse their own access

A May 2026 paper names a failure mode distinct from prompt injection: untrusted data should inform an agent's reasoning but never authorize side effects. AIRGuard enforces that line at action time.

2026-06-19 // 7 min affects: claude-haiku-4.5, claude-sonnet-4.6, gpt-5.4-mini, gpt-5.3-codex, mcp-agents

What is this?

On May 27, 2026, researchers from the University of Notre Dame, Inria and the University of Liverpool posted AIRGuard: Guarding Agent Actions with Runtime Authority Control (arXiv:2605.28914) on arXiv. The paper names a failure mode it calls authority confusion and proposes a runtime defense against it. The core idea fits in one sentence the authors repeat throughout: data can inform, but only authority can authorize.

Authority confusion is the gap between what an agent is allowed to do and what some piece of content suggests it should do. A tool-using agent reads files, runs shell commands, calls APIs, sends email and invokes MCP tools. Attacker-controlled content — a webpage, a retrieved document, a package, a helper script, an MCP tool result — can describe an action that looks task-relevant in isolation but quietly redirects the agent’s authorized access toward the attacker’s goal. The paper argues this is distinct from both jailbreaks and classic prompt injection, and that defenses built only on data–instruction separation or parameter provenance do not address it.

How it works

The distinction matters because the malicious step is rarely suspicious by its tool type. Reading a file, sending a message, calling a domain API or changing a configuration are all routine, legitimate actions. The problem is whose authority justifies them.

The paper’s worked examples make this concrete. Attacker-controlled documentation can label an external URL as an “audit” endpoint — but that label does not authorize the agent to transmit local reports, credentials or configuration data there. A package can contain installation instructions without authorizing persistence. An MCP tool output can suggest a recipient without authorizing an email. A downloaded script can help with a task without authorizing its own execution. In each case the action’s parameters may be well-grounded in observed evidence, yet the operation falls outside the scope the user actually granted.

This is why provenance and taint-style checks are insufficient on their own: evidence is not authority. An argument can be perfectly grounded in retrieved content while the resulting side effect is still unauthorized.

Why it matters

As agents move from producing text to taking actions, the blast radius of a single misjudged step grows: data exfiltration, configuration poisoning, supply-chain installs, unauthorized disclosure. Authority confusion also compounds across steps — individually plausible actions can add up to a harmful sequence that no single check would flag.

The paper quantifies the gap on two benchmarks: AgentTrap (141 cases, 91 of them authority-confusion attacks spanning exfiltration, config poisoning, MCP abuse, privilege escalation and more) and DTAP-150 (150 MCP domain tasks across code, filesystem, finance, legal and telecom). With no defense, attack success rate (ASR) on AgentTrap reached 20.9% with Claude Haiku 4.5 and 36.3% with Claude Sonnet 4.6 — meaning roughly a fifth to a third of attacks succeeded against capable models running ordinary agent loops.

Defenses

AIRGuard treats least privilege as an action-time authorization problem, not just a static permission assignment. It is a pre-action guard layer that, before each side-effecting tool call, checks four things: the authority inherited from the user task and policy, the trust level of the target, the trust level of the source, and the likely effect of the action. Concretely the paper combines capability mapping, authority inheritance (task-level authority can narrow into step-level authority but never expand), resource and target trust labels, source trust pools (high-reputation sources can inform execution while low-trust ones trigger inspection), side-effect simulation for sensitive actions, a tiered enforcement cascade, and a sequence audit that catches cross-step risk.

Two design lessons generalize beyond this one system. First, prompting is not enforcement: in the ablation, putting the policy in the prompt alone cut ASR only from 22% to 17%, while the runtime guard reached 4% — because it observes normalized tool calls and intervenes before side effects execute rather than asking the model to police itself. Second, expect a security–usability tension: stricter action-time checks reduce unauthorized side effects but can over-block legitimate work, so enforcement has to be selective (the authors report some over-defense, e.g. 6% on the DTAP-150 ablation).

For teams shipping agents today, the actionable takeaway is to add a deterministic authorization check at the tool boundary that is driven by the user’s task and your policy — not by the content the agent just read.

Status

ItemDetail
PaperAIRGuard, arXiv:2605.28914v1, posted May 27, 2026
TypeDefensive research (runtime guard), not an active exploit
Tested modelsClaude Haiku 4.5, Claude Sonnet 4.6; ablations with GPT-5.4-mini and GPT-5.3-codex
ResultAgentTrap ASR 20.9%→3.3% (Haiku), 36.3%→5.5% (Sonnet); best ASR tier on 3/4 models on DTAP-150
Baselines comparedARGUS, MELON

Reported figures are from the authors’ own evaluation and reflect their benchmarks and model versions as of the paper’s May 2026 release.

Sources