system: OPERATIONAL
← back to all hacks
INDIRECT INJECTION MEDIUM NEW

IPI Arena: a 272k-attack competition finds no agent model immune

Gray Swan's Indirect Prompt Injection Arena, judged with UK AISI and US CAISI, ran 272,000+ attacks against 13 frontier models. Every model was hijacked — and a single universal template broke nine of them.

2026-06-02 // 7 min affects: claude-opus-4.5, claude-sonnet-4.5, claude-haiku-4.5, gemini-2.5-pro, gemini-3-pro, qwen-3-vl-235b

What is this?

In March 2026, Gray Swan AI published results from its Indirect Prompt Injection (IPI) Arena — the largest public agent-hijacking competition run to date — in a paper, “How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition” (arXiv 2603.15714, March 16, 2026). The competition was designed with the UK AI Security Institute (UK AISI), the US Center for AI Standards and Innovation (US CAISI), and frontier labs including OpenAI, Anthropic and Meta. The US side summarised its takeaways in a CAISI research blog on March 23, 2026.

Over three weeks, 464 participants submitted more than 272,000 attack attempts against 13 frontier models across 41 agentic scenarios, for a $40,000 prize pool. Red teamers landed 8,648 successful attacks. The headline finding is blunt: across every model tested, at least one attack succeeded. No model was immune.

How it works

Indirect prompt injection is not jailbreaking. The user does nothing wrong. The attacker hides instructions inside content the agent later reads — an email, a web page, a document, a code repository — and the agent executes those instructions while returning an ordinary-looking response to the user.

The Arena’s distinguishing design choice was a dual success condition: an attack only counted if the agent (1) performed the harmful action and (2) concealed it from the user. Most prior IPI research measured only the first half. Requiring concealment is what makes the results operationally relevant — the attacks that count are the ones a user would never notice.

The scenarios spanned tool-use agents (email assistants, shopping bots, smart-home controllers), coding agents, and computer-use agents. Two structural findings stand out:

Finding                     What the data showed
--------------------------  ----------------------------------------------------
A universal template        One injection pattern worked across 21 of 41
                            scenarios and 9 models with minimal modification.
                            It frames the session as a simulated environment
                            with a fake "control panel," tricking the model into
                            treating injected text as system-level commands.

Transfer asymmetry          2,679 winning attacks were re-run against all 13
                            models. Attacks that broke the *most robust* model
                            (Claude Opus 4.5) transferred to the others at a
                            44-81% rate. Attacks that broke weaker models
                            barely transferred upward.

No payloads are reproduced here. The universal-template description above is the conceptual mechanism already published by Gray Swan; the runnable evaluation kit is open-sourced at GraySwanAI/ipi_arena_os for defenders to test their own systems.

Why it matters

Three results should reshape how you reason about agent risk.

First, attack success rate did not plateau. Models kept getting broken at a roughly constant rate for the full three weeks. More attacker effort always produced more breaks — there is no observed point at which a model becomes “attacked out.” A 0.5% success rate sounds tolerable until you remember a deployed agent may process thousands of untrusted inputs a day; at that scale it is a persistent, exploitable surface.

Second, capability and robustness are only weakly correlated. Gemini 2.5 Pro was among the most capable models tested and also the most vulnerable (8.5% ASR), while Claude Opus 4.5 was the most robust (0.5%). Model family and training recipe predicted robustness far better than benchmark scores. Robustness did improve within a family — Claude Haiku 4.5 (1.3%) → Sonnet 4.5 (1.0%) → Opus 4.5 (0.5%), and Gemini 3 Pro improved markedly over 2.5 Pro — but you cannot read security off a capability leaderboard.

Third, the transfer asymmetry inverts the usual intuition. Cheap tricks that beat weak models do not scale up; exploits that beat the strongest model cascade down to everything else. An attacker who invests in cracking the hardest target likely gets the rest for free.

Defenses

The paper’s own conclusion is that model-level robustness training is necessary but not sufficient — you need system-level and architectural defenses. Concretely:

  1. Isolate untrusted input from control flow. Treat any content an agent ingests (emails, web pages, documents, repos, tool output) as data, never as instructions. Architectural patterns that constrain what an agent can do regardless of what it reads — capability scoping, allowlisted actions, human approval on high-impact steps — match the failure mode the Arena documented. This is the same lesson behind the lethal trifecta and the agent rule of two.

  2. Don’t pick a model on capability alone. If you are choosing a model for an agentic deployment, weigh published hijacking-robustness data alongside capability. Comparative benchmarks like this one exist precisely so deployers can see the risk profile of each option.

  3. Test for concealment, not just success. Your red-team and monitoring should flag the case where an agent takes an action and the user-facing summary omits it. Logging the full action trace independently of the model’s natural-language output is the control that surfaces the attacks that count.

  4. Run the open benchmark against your own stack. The evaluation kit (scenarios, judging system, a sample of attacks) lets you test your specific agent configuration and any defenses you bolt on, rather than trusting a vendor’s headline number.

  5. Assume universal, transferable attacks. Because one template broke nine models and strong-model exploits transfer downward, defenses tied to a single model’s quirks will not hold. Build defenses at the orchestration layer that survive a model swap.

  6. Plan for benchmark refresh. Gray Swan states the benchmark will be updated quarterly with new scenarios and models. Treat agent-security posture as a moving target and re-evaluate on each model upgrade, not once at launch.

Status

ItemReferenceDateNotes
IPI Arena paper (arXiv 2603.15714)arXiv2026-03-1613 models, 464 participants, 272k+ attempts, 8,648 successful
Gray Swan write-upGray Swan AI2026-03-18ASR 0.5% (Claude Opus 4.5) → 8.5% (Gemini 2.5 Pro)
CAISI research blogNIST2026-03-23US government summary; full dataset shared with UK AISI & US CAISI
Evaluation kitGitHub (GraySwanAI/ipi_arena_os)2026-03Open-source scenarios + judge; 95 Qwen-3-VL-235B attacks released
Planned cadenceGray Swan AIquarterlyRecurring competitions with new scenarios and latest models

The correct reading is not “AI agents are broken.” It is “indirect prompt injection is an unsolved, structural property of current instruction-following models, it does not plateau under attacker pressure, and the only durable defenses live above the model.” If your architecture assumes the model will resist injected instructions, the Arena data says it won’t.

Sources