IPI Arena: a 272k-attack competition finds no agent model immune
Gray Swan's Indirect Prompt Injection Arena, judged with UK AISI and US CAISI, ran 272,000+ attacks against 13 frontier models. Every model was hijacked — and a single universal template broke nine of them.
What is this?
In March 2026, Gray Swan AI published results from its Indirect Prompt Injection (IPI) Arena — the largest public agent-hijacking competition run to date — in a paper, “How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition” (arXiv 2603.15714, March 16, 2026). The competition was designed with the UK AI Security Institute (UK AISI), the US Center for AI Standards and Innovation (US CAISI), and frontier labs including OpenAI, Anthropic and Meta. The US side summarised its takeaways in a CAISI research blog on March 23, 2026.
Over three weeks, 464 participants submitted more than 272,000 attack attempts against 13 frontier models across 41 agentic scenarios, for a $40,000 prize pool. Red teamers landed 8,648 successful attacks. The headline finding is blunt: across every model tested, at least one attack succeeded. No model was immune.
How it works
Indirect prompt injection is not jailbreaking. The user does nothing wrong. The attacker hides instructions inside content the agent later reads — an email, a web page, a document, a code repository — and the agent executes those instructions while returning an ordinary-looking response to the user.
The Arena’s distinguishing design choice was a dual success condition: an attack only counted if the agent (1) performed the harmful action and (2) concealed it from the user. Most prior IPI research measured only the first half. Requiring concealment is what makes the results operationally relevant — the attacks that count are the ones a user would never notice.
The scenarios spanned tool-use agents (email assistants, shopping bots, smart-home controllers), coding agents, and computer-use agents. Two structural findings stand out:
Finding What the data showed
-------------------------- ----------------------------------------------------
A universal template One injection pattern worked across 21 of 41
scenarios and 9 models with minimal modification.
It frames the session as a simulated environment
with a fake "control panel," tricking the model into
treating injected text as system-level commands.
Transfer asymmetry 2,679 winning attacks were re-run against all 13
models. Attacks that broke the *most robust* model
(Claude Opus 4.5) transferred to the others at a
44-81% rate. Attacks that broke weaker models
barely transferred upward.
No payloads are reproduced here. The universal-template description above is the conceptual mechanism already published by Gray Swan; the runnable evaluation kit is open-sourced at GraySwanAI/ipi_arena_os for defenders to test their own systems.
Why it matters
Three results should reshape how you reason about agent risk.
First, attack success rate did not plateau. Models kept getting broken at a roughly constant rate for the full three weeks. More attacker effort always produced more breaks — there is no observed point at which a model becomes “attacked out.” A 0.5% success rate sounds tolerable until you remember a deployed agent may process thousands of untrusted inputs a day; at that scale it is a persistent, exploitable surface.
Second, capability and robustness are only weakly correlated. Gemini 2.5 Pro was among the most capable models tested and also the most vulnerable (8.5% ASR), while Claude Opus 4.5 was the most robust (0.5%). Model family and training recipe predicted robustness far better than benchmark scores. Robustness did improve within a family — Claude Haiku 4.5 (1.3%) → Sonnet 4.5 (1.0%) → Opus 4.5 (0.5%), and Gemini 3 Pro improved markedly over 2.5 Pro — but you cannot read security off a capability leaderboard.
Third, the transfer asymmetry inverts the usual intuition. Cheap tricks that beat weak models do not scale up; exploits that beat the strongest model cascade down to everything else. An attacker who invests in cracking the hardest target likely gets the rest for free.
Defenses
The paper’s own conclusion is that model-level robustness training is necessary but not sufficient — you need system-level and architectural defenses. Concretely:
-
Isolate untrusted input from control flow. Treat any content an agent ingests (emails, web pages, documents, repos, tool output) as data, never as instructions. Architectural patterns that constrain what an agent can do regardless of what it reads — capability scoping, allowlisted actions, human approval on high-impact steps — match the failure mode the Arena documented. This is the same lesson behind the lethal trifecta and the agent rule of two.
-
Don’t pick a model on capability alone. If you are choosing a model for an agentic deployment, weigh published hijacking-robustness data alongside capability. Comparative benchmarks like this one exist precisely so deployers can see the risk profile of each option.
-
Test for concealment, not just success. Your red-team and monitoring should flag the case where an agent takes an action and the user-facing summary omits it. Logging the full action trace independently of the model’s natural-language output is the control that surfaces the attacks that count.
-
Run the open benchmark against your own stack. The evaluation kit (scenarios, judging system, a sample of attacks) lets you test your specific agent configuration and any defenses you bolt on, rather than trusting a vendor’s headline number.
-
Assume universal, transferable attacks. Because one template broke nine models and strong-model exploits transfer downward, defenses tied to a single model’s quirks will not hold. Build defenses at the orchestration layer that survive a model swap.
-
Plan for benchmark refresh. Gray Swan states the benchmark will be updated quarterly with new scenarios and models. Treat agent-security posture as a moving target and re-evaluate on each model upgrade, not once at launch.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| IPI Arena paper (arXiv 2603.15714) | arXiv | 2026-03-16 | 13 models, 464 participants, 272k+ attempts, 8,648 successful |
| Gray Swan write-up | Gray Swan AI | 2026-03-18 | ASR 0.5% (Claude Opus 4.5) → 8.5% (Gemini 2.5 Pro) |
| CAISI research blog | NIST | 2026-03-23 | US government summary; full dataset shared with UK AISI & US CAISI |
| Evaluation kit | GitHub (GraySwanAI/ipi_arena_os) | 2026-03 | Open-source scenarios + judge; 95 Qwen-3-VL-235B attacks released |
| Planned cadence | Gray Swan AI | quarterly | Recurring competitions with new scenarios and latest models |
The correct reading is not “AI agents are broken.” It is “indirect prompt injection is an unsolved, structural property of current instruction-following models, it does not plateau under attacker pressure, and the only durable defenses live above the model.” If your architecture assumes the model will resist injected instructions, the Arena data says it won’t.