Agent security lives in the transitions, not the components
A June 2026 synthesis of 247 papers reframes LLM-agent security around state transitions: harm happens when untrusted text silently becomes a plan, a decision, an action, or durable memory.
What is this?
Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation (arXiv:2606.10749, posted June 2026) is a systematization of 247 papers on agent security, organized around a single systems-level idea: an LLM agent is not a chatbot that occasionally calls a tool, it is a loop that connects information flow, delegated authority, and persistent state. Once a model sits in that loop, failures stop being “the model said something unsafe” and become hijacked workflows, unauthorized tool calls, corrupted memory, and harmful external actions.
The corpus itself tells a story about how fast this is moving: 3 papers in 2023, 42 in 2024, then 121 in 2025, with 81 more collected by late April 2026 — roughly a third of the field appearing in a single partial year. Most of it (about 68%) is still arXiv preprints, which the authors flag as a sign the area is real but not yet standardized.
How it works
The survey’s central contribution is a reframing, not an exploit. It argues agent security should be analyzed in terms of state transitions rather than components in isolation. The dangerous moment is not when the agent reads untrusted text — it is when that text is allowed to change category without mediation:
- untrusted content gets reinterpreted as a planning constraint;
- a tentative plan hardens into an executable decision;
- a stored trace is later reused as trusted context.
That view explains why agent security looks different from both classic application security and prompt-only model safety. The question is not only what the system has seen, but what it is now allowed to do because it saw it.
The empirical surface map backs this up. Counting threat surfaces across the corpus, User Prompts lead at 82 papers — but they are only one entry point. Web Content appears in 55 papers, Tool Outputs in 54, Retrieved Content in 37, and Files/Code, the Planning Loop, Memory/Scratchpads, and Inter-agent Channels each in at least 25. Direct prompting is a minority of the attack surface; mediated and internalized control paths dominate. Three modeling principles fall out: data–control ambiguity (agents consume text that is task-relevant but untrusted), delegated authority (the agent acts under permissions the attacker does not own), and persistence and propagation (harm is delayed, surfacing later through memory or agent-to-agent messages).
Crucially, this is not one team’s theory. Microsoft’s AI Red Team, drawing on twelve months of engagements against deployed agents, published an updated failure-mode taxonomy on June 4, 2026 that independently names the same frontier: session context contamination, memory poisoning via cross-domain injection, inter-agent trust escalation, and capability disclosure that turns black-box probing into a white-box exploit path. Two very different methods — a literature synthesis and operational red teaming — converge on the same conclusion.
Why it matters
The headline finding is uncomfortable: prompt injection and tool-mediated control-flow hijacking still dominate, while persistent-state corruption and multi-agent propagation are the rising concern — and current defenses are weakly compositional. Individual mitigations work in isolation but do not stack cleanly into an end-to-end guarantee. Microsoft’s field data sharpens the point: human-in-the-loop bypass was the most consistently exploited failure mode, sometimes as zero-click chains where no single step looked anomalous but the compound outcome was exfiltration or lateral movement.
The survey also indicts how we measure progress. Existing benchmarks under-represent long-horizon, stateful, and deployment-sensitive risks — exactly the transitions that matter most. A defense that scores well on single-turn injection tests can still fail when contamination is seeded early and detonates many steps later, across a session or a delegation chain.
Defenses
The paper’s prescription is architectural, and it maps cleanly onto controls defenders can adopt now:
- Make trust boundaries explicit. Tag every channel by authority. Web content, retrieved documents, and tool outputs are low-authority observations and must never be silently promoted to instructions. Keep trusted system context structurally separate from untrusted retrieved content.
- Control privilege at the action, not the prompt. Put capability checks at the tool-execution transition, scoped to least privilege. The model deciding to act is not authorization to act.
- Manage state with provenance. Track where memory entries came from; treat agent-written memory as untrusted on read-back. A single successful injection that seeds memory can propagate across every later session — so sanitize and bound what gets persisted.
- Harden human-in-the-loop as a security control. Decompose compound actions before approval, summarize approvals from the underlying tool calls rather than the agent’s own description, tier approvals by reversibility and blast radius, and watch for consent-fatigue patterns.
- Verify agent identity, don’t infer it. In multi-agent systems, require attestable credentials at handoff; reject self-asserted roles. This closes the confused-deputy path of inter-agent trust escalation.
- Evaluate across full trajectories. Test long-horizon, stateful scenarios — seed contamination early and measure downstream — not just single-turn injection. (OWASP LLM01 remains the baseline reference for the injection class.)
Status
| Item | Detail |
|---|---|
| Primary source | Toward Secure LLM Agents (arXiv:2606.10749), June 2026, 247 papers |
| Corroboration | Microsoft AI Red Team failure-mode taxonomy v2.0, June 4, 2026 |
| Dominant threats | Prompt injection, tool-mediated control-flow hijacking |
| Emerging frontier | Persistent-state corruption, multi-agent propagation |
| Key gap | Defenses weakly compositional; benchmarks miss long-horizon/stateful risk |
| Nature | Defensive systematization — no exploit code, no novel attack |
The practical takeaway is a mental model, not a patch. Stop asking only whether your agent can be tricked into saying something, and start mapping the transitions in your system — input to plan, plan to decision, decision to action, action to stored memory, memory to the next agent. Every one of those arrows is a place where untrusted data can cross into authority. The labs and the literature now agree that securing those arrows, not hardening the model alone, is where agent security is won or lost.