DEFENSE LOW NEW

AuthGraph: dual-graph alignment to catch agent prompt injection

A May 26, 2026 UCLA paper compares a clean authorization graph against the agent's actual provenance graph, cutting AgentDojo attack success from 40% to 1%.

2026-06-19 // 6 min affects: llm-agents, tool-using-agents, mcp-clients

What is this?

AuthGraph is a defensive framework for tool-using LLM agents, described in an arXiv preprint (2605.26497, cs.CR) posted on May 26, 2026 by Peiran Wang and colleagues at UCLA. It targets indirect prompt injection: the attack where an agent reads an external data source it does not control — an email, a webpage, a file — and that source carries hidden instructions that steer the agent into an unauthorized action, such as transferring funds to an attacker-controlled account.

The paper’s framing is that existing defenses share a structural blind spot. Tool-call value checkers inspect a call’s arguments without tracking where those values came from. Trace-graph analyzers build one graph of the execution and inspect it after the fact — but if the injection already manipulated the agent while that graph was being built, the graph faithfully records the manipulated view, with nothing to compare it against. AuthGraph’s contribution is to build a second, independent graph that the injection cannot reach, and to detect the attack by comparing the two.

How it works

AuthGraph constructs two complementary graphs over a single agent task. The first is the injected reasoning graph (IRG): information provenance reconstructed from the agent’s actual execution trajectory, deliberately exposed to whatever the agent read, including any injected content. It records the agent’s “subjective view” of where each value came from — manipulation and all.

The second is the authorization graph. It is derived from the user’s original intent in an isolated, clean context that, by construction, never sees the untrusted data. The authors describe this baseline as information-theoretically impossible to influence through injection: the planner that builds it simply is not shown the attacker-controlled bytes. This graph is parameter-source-level (it constrains not just which tools may run but where each argument is allowed to originate), least-privilege, and extensible at runtime.

A graph alignment checker then structurally compares the two. Because the authorization graph is an unforgeable reference for “what the agent should do” and the IRG captures “what the agent actually did,” a mismatch exposes the injection — at both the tool level (an action that was never authorized) and the parameter-source level (an authorized action whose argument was silently sourced from poisoned data). Crucially, the final verdict rests on the raw trajectory evidence, not on an LLM reasoning over text that may itself be poisoned.

The running example is a fraudulent book_flight(flight_id="EVIL-123") call: a per-call value check or a single-graph trace cannot tell that the flight_id was injected, but a structural comparison against a clean authorization baseline can.

Why it matters

This is the confused-deputy problem at the heart of agent security: the agent is authorized to act, but the data it consulted has been corrupted, so it faithfully executes a plan with attacker-chosen parameters. It is the same lethal trifecta — private data, untrusted content, and an external action channel in one task — that Simon Willison has documented at length.

The reported numbers are the reason to pay attention. On the AgentDojo benchmark, AuthGraph reduces attack success rate from 40% to 1% while keeping a 76% task-completion rate on GPT-4o; on AgentDyn, it drops attack success from 39% to 2% while preserving 51% utility. The authors report this outperforms recent plan-then-check and information-flow defenses including CaMeL, DRIFT, and Progent. The practical surface is any agent that reads attacker-reachable content and can then take consequential actions — payments, email, deployments, file writes.

Defenses

The takeaway for builders is architectural, and it generalizes beyond this specific implementation. Derive an authorization specification from user intent before the agent touches untrusted data, and keep that specification in a context the untrusted data can never enter — an injection-free baseline is only trustworthy if it is structurally isolated, not merely prompted to ignore instructions. Track provenance at the parameter-source level, not just per tool call, so a value derived from poisoned input cannot quietly become an argument to a sensitive action. Base the final allow/deny decision on trajectory evidence rather than on a model summarizing text that may already be compromised. These ideas extend the lineage-and-least-privilege direction of related work such as provenance-graph defenses and the design patterns for securing LLM agents (Beurer-Kellner et al., June 2025), which argue that prompt injection must be contained architecturally rather than solved at the model layer.

Limitations to keep in mind before relying on it: AuthGraph is a detection-and-alignment layer evaluated on benchmarks, not a shipping product; it depends on being able to derive a faithful clean-context authorization graph and to reconstruct provenance from the trajectory; and the residual attack success is reduced, not zero. It contains and detects manipulation rather than preventing a model from being manipulated in the first place.

Status

The work is a May 26, 2026 preprint (arXiv:2605.26497v1) from UCLA, evaluated on the AgentDojo and AgentDyn injection benchmarks against GPT-4o and compared with CaMeL, DRIFT, and Progent. No CVE is associated, because AuthGraph describes a defense, not a vulnerability. Teams running agents in production can adopt the underlying principle today — an isolated, parameter-source-level authorization baseline compared structurally against execution provenance — independently of this particular prototype.