system: OPERATIONAL
← back to all hacks
INDIRECT INJECTION MEDIUM NEW

Injection depth in ReAct agents: position beats wording

A June 2026 study of tool-calling ReAct agents finds injection depth—not rhetoric—drives indirect prompt injection: success falls from 60% at the first tool call to 0% by the fourth.

2026-06-15 // 6 min affects: react-agents, tool-calling-agents, llm-agents

What is this?

Most indirect prompt injection research asks what a malicious payload should say — which phrasing, which authority cues, which obfuscation slips past a model’s defenses. A new arXiv paper, “Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents” (arXiv:2605.30686, June 2026), asks a different question: where in an agent’s run does the payload land, and does that matter more than the wording. The answer is that position dominates. An injection placed in the first tool result the agent reads is far more likely to succeed than the same text placed a few steps later.

The study targets ReAct agents — the now-standard loop that interleaves chain-of-thought reasoning with tool calls, used for scheduling, file retrieval, web browsing, and data access. Any tool whose return value an attacker controls (a web page, a document, an email, an API response) is a channel for indirect prompt injection. What the paper adds is a structured measurement of three variables — injection depth, payload framing, and turn budget — instead of treating injection as a single yes/no event.

How it works

A ReAct agent runs a loop: think, call a tool, read the tool’s output (the observation), think again, call another tool, and so on until it answers or hits a turn cap. Indirect prompt injection hides instructions inside one of those observations so the agent treats attacker text as if it were part of its own task.

The authors vary three dimensions and measure attack success rate (ASR):

# Conceptual study design — measurement, not an exploit recipe.
injection depth  : which tool observation in the sequence carries the payload (1st, 2nd, ... 5th)
payload framing  : the rhetorical register (e.g. plain instruction vs. "helpful next step")
turn budget      : how many tool-calling turns the agent is allowed before it must answer

The headline result is that injection depth is the dominant variable. ASR falls monotonically with depth: roughly 60% at depth 1 (the first observation the agent reads) down to 0% at depths 4 and 5. Put plainly, an injection that the agent encounters early — while its plan is still forming — steers it; the same injection encountered late, once the agent is committed to a trajectory and close to answering, is largely ignored.

Two consequences follow. First, the paper reports that sanitising only the first tool observation captures about 67% of measured injection successes — a small slice of the context window accounts for most of the risk. Second, the effective design lever for an attacker is structural, not rhetorical: success comes less from clever wording and more from positioning the instruction in a tool output where the requested action reads as a plausible next step. This echoes the framing in “Design Patterns for Securing LLM Agents against Prompt Injections” — that where untrusted data enters the control flow matters more than how it is phrased — and builds on the InjecAgent benchmark that first formalised tool-integrated IPI.

Why it matters

The depth effect reframes where defensive budget should go. Teams often apply uniform input sanitisation to every tool result, or none at all. This measurement says the first one or two observations in an agent’s run are disproportionately dangerous, because that is the window in which the agent’s plan is most malleable. It also explains why some injections that “work” in a single-shot test fail in a longer agentic trace, and vice versa — the same payload has a different blast radius depending on when the agent meets it.

It is worth stating the limits. These are ASR figures from one study’s harness, on a set of models and tasks the authors chose; depth-0 dominance is a tendency, not a guarantee, and a determined attacker who controls the first retrieved source still has a wide opening. The result is a prioritisation signal, not a safe-by-default rule. Treating “sanitise the first observation” as sufficient on its own would be exactly the wrong lesson.

Defenses

The practical takeaway is to weight scrutiny by depth rather than spreading it evenly.

Apply the strongest provenance and sanitisation checks to the earliest tool observations, where the paper shows the agent is most steerable, while still screening later ones. Mark every tool return as untrusted data, never instructions — the instruction-hierarchy principle — so that position becomes a tuning knob on top of a sound trust model, not a replacement for it.

Pair this with trajectory-level defenses that do not depend on catching the payload at ingestion. Inference-time correction schemes such as ICON (arXiv:2602.20708, February 2026) detect and repair a compromised trajectory mid-run while preserving task continuity, which covers the case where a late or well-positioned injection slips past input filtering. Verifying tool calls before they commit — the verify-before-commit pattern — catches an injected action regardless of which observation introduced it.

Finally, constrain blast radius architecturally. Keeping agents within the Agents Rule of Two — limiting how many of (untrusted input, private data, external action) a single agent combines — means that even a depth-1 injection that does steer the agent has less it can reach. Depth-weighted filtering reduces how often an agent is hijacked; capability limits bound how bad each hijack can be.

Status

DimensionFindingSourceDate
Injection depthASR ~60% at depth 1, monotonic to 0% at depth 4–5arXiv:2605.30686Jun 2026
First-observation sanitisationCaptures ~67% of injection successesarXiv:2605.30686Jun 2026
Effective attacker leverStructural (position) over rhetorical (framing)arXiv:2605.30686Jun 2026
Trajectory repair defenseICON inference-time correctionarXiv:2602.20708Feb 2026

This is published measurement research with a defensive reading, not an unpatched product vulnerability. The contribution is prioritisation: in a ReAct loop, the first thing the agent reads from the outside world deserves the most suspicion.

Sources