Automated prompt injection is model-dependent: TAP beats GCG, GPT-5 resists
A June 9, 2026 ETH Zurich study adapts GCG and TAP to AgentDojo across 80 agent task pairs. Black-box TAP beats gradient-based GCG, yet attacks tuned on small models fail to transfer to GPT-5.
What is this?
On June 9, 2026, three ETH Zurich researchers — David Hofer, Edoardo Debenedetti and Florian Tramèr — published Assessing Automated Prompt Injection Attacks in Agentic Environments (arXiv:2606.10525). It is the first systematic measurement of whether the automated attack methods that work for jailbreaking also work for indirect prompt injection (IPI) against tool-calling agents. The short answer: they work, but unevenly. Against small open-weight models the success rates are real; against a frontier model (GPT-5) they collapse, and attacks optimized on small models do not transfer up. Automated injection is a credible threat — but a strongly model-dependent one.
How it works
The team adapted two known jailbreak optimizers to the agentic setting inside AgentDojo, the standard benchmark for agents acting on untrusted data. The white-box method is GCG, which uses gradients to search for an adversarial token string; the black-box method is TAP, which uses an attacker LLM to iteratively rewrite an injection and prune dead ends. No payloads are reproduced here — the contribution is the measurement, not a recipe.
The evaluation spans 80 task pairs across four domains (workspace, banking, travel, slack). The headline numbers, on the small Qwen3-4B target:
Method (Qwen3-4B target) Attack Success Rate
---------------------------- -------------------
Universal TAP (black-box) 45.2%
Single-task TAP 44.6%
Universal GCG (white-box) 24.1%
Single-task GCG 23.0%
Two structural findings stand out. First, black-box beats white-box: TAP roughly doubles GCG’s success, which the authors attribute to GCG’s optimization instability under a realistic compute budget. Second, the attack’s strength depends on the attacker model — a stronger, less safety-tuned attacker LLM produces better injections, while a safety-tuned attacker sometimes refuses to generate them at all.
Why it matters
The interesting result is the ceiling, not the floor. On GPT-5, the best attacks reach only about 4.5–4.7% ASR, and GCG strings transferred from Qwen3-4B land below 1%. Universal injections that generalize to held-out task domains on the small model drop to 0% on GPT-5’s held-out domain. In other words, the cheap path — optimize an injection against an open model you control, then fire it at a frontier deployment — largely does not work today.
That is good news with an expiry date. It says model-agnostic, push-button injection is not here yet; it does not say agents are safe. Slack-style tasks were the most vulnerable surface (around 67% ASR on the small model), and even a plain instruction with no optimization scored ~25% there. Anyone running open-weight or smaller models in an agent loop over untrusted content is squarely in the exploitable range the paper measures.
Defenses
The paper’s own finding — frontier robustness plus poor cross-model transfer — is a reason to be deliberate about model choice for agents that read untrusted data, not a reason to relax. The durable mitigations are architectural and predate this work:
- Treat tool output as data, never as instructions. Keep retrieved content out of the privileged instruction channel; AgentDojo exists precisely to test defenses built on this separation.
- Authorize the action, not the text. Gate every consequential tool call (send, pay, share, delete) on the user’s original intent, with human confirmation for irreversible operations.
- Constrain the blast radius. Least-privilege tool scopes, allow-listed recipients and per-session spend/scope limits turn a successful injection into a contained one.
- Watch the high-risk surfaces first. Messaging and email tools showed the highest susceptibility — prioritize monitoring and guardrails there.
- Re-test under optimization, not just static prompts. A defense that survives a hand-written injection can still fall to an adaptive, attacker-LLM-driven one; evaluate with automated red-teaming.
Status
| Item | Detail |
|---|---|
| Publication | arXiv:2606.10525 v1, 9 June 2026 |
| Authors | Hofer, Debenedetti, Tramèr (ETH Zurich) |
| Framework | AgentDojo (extended for white-box access) |
| Most robust model tested | GPT-5 (~5% ASR; transferred GCG <1%) |
| Most vulnerable surface | Slack-style messaging tasks (~67% ASR on Qwen3-4B) |
| Nature | Defensive measurement study — no exploit released |