system: OPERATIONAL
← back to all hacks
INDIRECT INJECTION MEDIUM NEW

IterInject: when an LLM optimiser writes its own indirect prompt injections

A May 23, 2026 paper closes the loop between payload, diagnoser and LLM optimiser — lifting indirect-injection ASR from near-zero to 33–90% on InjecAgent and compromising 5 of 9 Claude Code targets.

2026-05-28 // 6 min affects: agentdojo, injectagent, claude-code, deepseek, tool-using-llm-agents

What is this?

On May 23, 2026, researchers from Shanghai Jiao Tong University and the University of Hong Kong posted IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization on arXiv. The paper does not claim a new payload pattern. It claims something more uncomfortable: that the search problem at the heart of indirect prompt injection (IPI) can be automated, and that the resulting optimiser pushes attack success rates well above what static or hand-tuned payloads have been able to reach on the AgentDojo and InjecAgent benchmarks.

The headline numbers, taken from the paper’s evaluation: on AgentDojo’s 510 attack instances across four task suites, IterInject achieves the highest overall ASR on every one of the four victim models tested, with the largest delta on DeepSeek (47.8% versus 32.9% for the strongest static prompt baseline). On InjecAgent, total ASR moves from near-zero with static prompts to 33–90% depending on the victim. The authors also run an extension experiment against Claude Code, a production-grade coding agent with layered defenses, and report 5 of 9 targets compromised with optimised payloads.

How it works

IterInject treats payload crafting as a closed loop with three components, all of which are described in the paper at the level of mechanism — not at the level of a copy-paste exploit.

                +-------------------------+
                |   victim LLM agent      |
                |   (runs against the     |
                |    benchmark task)      |
                +-----------+-------------+
                            |
                            v
                +-------------------------+
                |  rule-based diagnoser   |
                |  -> structured outcome  |
                |     label + behavioural |
                |     description         |
                +-----------+-------------+
                            |
                            v
                +-------------------------+
                |  LLM-based optimiser    |
                |  conditioned on the     |
                |  full optimisation log  |
                |  -> next payload        |
                +-----------+-------------+
                            |
                            v
                  (loop, with seed
                   synthesis to expand
                   the strategy space)

Three properties matter for defenders.

First, the diagnoser is structured, not free-text. Each attempt is labelled with a behavioural outcome (tool invoked, parameter exfiltrated, refusal, no-op, partial deviation), which means the optimiser is not flailing on a single binary success signal. This is what lets the loop escape the local optima that hand-crafted IPI prompts tend to fall into.

Second, the optimiser is itself an LLM conditioned on the running log. It does not need gradient access to the victim model, which is what makes the attack work on closed-source production agents. This is the same observation that has driven adaptive jailbreaking work over the past eighteen months (see The Attacker Moves Second) — applied here to the IPI surface specifically.

Third, the authors run a mechanistic analysis on top of the empirical results. They report an attention-mediated threshold: payloads succeed when injected content draws enough of the model’s attention away from the original system prompt to cross a per-model boundary. Causal interventions on the relevant attention heads change the outcome. The implication is that “data-instruction separation” defences — the dominant paradigm for IPI guardrails today — are fighting on the wrong axis: they sanitise text without changing how attention is allocated once that text is in the context window.

Why it matters

The “novel attack” framing is the wrong one. The technique generalises a search procedure that was already in the literature for jailbreaks (PAIR, TAP, automated prompt optimisation) and ports it cleanly to the IPI setting. What is new is the floor that closed-loop optimisation now sets for the attacker.

Three takeaways for teams shipping agents.

The gap between research benchmarks and production agents is small. Five of nine Claude Code targets compromised under an adaptive loop is the same order of magnitude as the lab benchmarks. If you are running a coding agent against untrusted source files, pull requests, or external documentation, the threat model is no longer “a determined human writes the perfect prompt” — it is “an LLM writes a few hundred prompts overnight and keeps the ones that worked”.

The mechanistic finding undercuts entire defence categories. If success is governed by attention allocation rather than by surface-form distinguishability between data and instructions, then prefix-tagging, role-tagging, and input sanitisation defences will keep producing the same disappointing AgentDojo numbers — defences that drop ASR also drop utility, and the underlying paper’s IterInject results widen that gap, not close it.

The evaluation must assume adaptivity. A guardrail that holds against a fixed corpus of injection strings tells you almost nothing about its behaviour under a feedback-guided optimiser. This is now a third independent line of evidence — Carlini et al. on jailbreaks, Abdelnabi & Bagdasarian on contextual integrity, and now IterInject on IPI specifically — saying the same thing.

Defenses

The paper itself ends on defence implications. The following are concrete, today-actionable:

  1. Stop measuring IPI defences with static benchmarks alone. If your guardrail’s eval is a fixed list of attack strings, replicate the IterInject loop (or its public predecessor AgentVigil) against your stack before you ship. ASR numbers from a static eval will be 20–60 percentage points lower than what an adaptive attacker reaches.

  2. Constrain attention surface at the agent layer, not only the prompt layer. Reduce the size of the untrusted blob that lands in context (chunking, summarisation through a clean model, structured-extraction-only ingestion), keep tool definitions and system prompt in a separate role/segment, and limit per-call tool capability to the minimum the current task needs. The goal is to keep injected content from accumulating enough attention mass to cross the threshold the paper identifies.

  3. Detect at the action layer. Tool-call honeypots — fake tools, fake credentials, allowlisted parameters — give a clean compromise signal that does not depend on parsing natural-language intent. See AgentShield (May 10, 2026) for one published instantiation, and our coverage of tool-result parsing defences for complementary techniques.

  4. Assume Claude Code (and equivalents) are part of your attack surface. The paper’s extension result is not a Claude-Code-specific bug; it is a generic IPI optimiser applied against a layered-defence target with non-trivial success. Treat external content read by your coding agent — issues, PRs, dependency READMEs, tool outputs — as untrusted by default, and gate destructive actions behind explicit human confirmation.

  5. Rate-limit and log the search loop. An adaptive IPI optimiser is noisy from the API side: it makes many similar calls in a short window with monotonically improving outcome signals. API-side anomaly detection on prompt similarity, tool-call diversity, and per-session retry patterns is a cheap secondary control even before you touch the agent design.

Status

ItemReferenceDateNotes
IterInject paperarXiv 2605.246592026-05-23SJTU + HKU authors
AgentDojo benchmarkspylab.ai / GitHublive510 attack instances across four suites; used by US/UK AISI
InjecAgent benchmarkarXiv 2403.026912024-03Total ASR on IterInject: 33–90% across four victims
Claude Code extension experimentSame paper, §extension2026-05-235 of 9 targets compromised
Companion adversarial-evaluation workThe Attacker Moves Second, arXiv 2510.090232025-10Same conclusion across 12 defences, jailbreak side
Contextual-Integrity resultarXiv 2605.176342026-05-17Theoretical companion: separation is the wrong frame

The right framing for builders is not “another IPI paper”. It is “an automated search procedure now sits between your guardrail and an attacker, and the floor of what that search reaches is rising every quarter”. Re-baseline your IPI evals against an adaptive optimiser, or read the next set of benchmark numbers as a description of your own production agent.

Sources