DEFENSE MEDIUM NEW

AgentDyn: why injection defenses that ace static benchmarks fail in the wild

A February 2026 ICML benchmark, AgentDyn, runs ten leading prompt-injection defenses on dynamic, open-ended agent tasks. Almost all are either insecure or over-defend into uselessness.

2026-06-12 // 6 min affects: gpt-4o, gpt-5.1, gemini-2.5-pro, llama-3.3-70b, qwen3-235b

What is this?

AgentDyn is a prompt-injection benchmark for tool-using LLM agents, published on arXiv in February 2026 (2602.03117, authors Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang and Chaowei Xiao; code at github.com/leolee99/AgentDyn). Its finding is uncomfortable: of ten state-of-the-art defenses that report near-perfect numbers on the popular static benchmark AgentDojo, almost none are deployable once tasks become dynamic and open-ended. They are either still insecure, or they “defend” by destroying the agent’s usefulness.

The paper is a methodology critique, not an exploit. It matters because defenders increasingly cite leaderboard ASR (attack success rate) figures — often near zero — as evidence that prompt injection is handled. AgentDyn argues those figures are an artifact of how the benchmark is built. This echoes a wider 2026 theme; see our note on why benchmarking agents is hard.

How it works

AgentDyn identifies three structural flaws in current static benchmarks and builds against them. First, lack of dynamic, open-ended tasks: in AgentDojo only 6 of 97 tasks require replanning, so an agent can plan its whole action sequence up front. A defense can then look secure simply by sticking to that initial plan — a shortcut that breaks the moment a task requires adapting mid-execution. Second, lack of helpful instructions: real third-party content is full of benign, useful instructions (“please log in first” on a checkout page), and whether an instruction is malicious is context-dependent. A defense that just ignores all external instructions scores well on a benchmark that contains none — and falls apart in reality. Third, simplistic user tasks: prior benchmarks average 1–3 steps, 1–2 apps and under 20 tools.

AgentDyn answers with 60 open-ended tasks and 560 injection test cases across Shopping, GitHub and Daily Life, averaging 7.1 steps and 3.17 application scenarios per task, all requiring dynamic planning with benign instructions interleaved. Built on the AgentDojo framework, it was run against eight agents (GPT-4o, GPT-5.1, Gemini-2.5-Pro/Flash, Llama-3.3-70B, Qwen3-235B and others) and four defense families.

Why it matters

The results expose a defense trilemma, not a tuning problem (a theme we cover in the prompt-injection wrapper trilemma). On GPT-4o:

Prompting defenses (Prompt Sandwiching, Spotlighting) keep utility but barely move ASR versus no defense (~27–31%).
Filtering (ProtectAI, PIGuard) cannot tell helpful instructions from injections and drive utility to near zero; PromptGuard2 holds utility until an attack appears, then discards the whole tool output and still leaves 27.15% ASR.
System-level designs that enforce a fixed plan, such as CaMeL, hit 0% ASR but also 0% utility on fully open-ended tasks. Plan-dependent defenses (Tool Filter, Progent, DRIFT) suffer heavy utility loss as toolsets grow and early access decisions block tools needed later.
The one relatively balanced result is alignment (Meta SecAlign 70B), which improves utility while shaving ASR — yet still leaves a ~9% residual.

The lesson for anyone shipping agents: a defense advertised at near-zero ASR may have bought that number with over-defense you will feel as broken workflows, or with a benchmark that never tested adaptive, multi-step tasks. The same caution applies to reading any single operating point — see detector benchmarks and operating points.

Defenses

AgentDyn is itself a defensive tool. Concrete takeaways:

Re-test defenses on dynamic, long-horizon tasks. Treat AgentDojo-style near-zero ASR as necessary, not sufficient. Use AgentDyn or comparable open-ended suites before trusting a vendor claim.
Measure utility under defense, not just ASR. A control that zeros out attacks while halving task completion is not a win; report both numbers together.
Prefer adaptive over plan-frozen controls. Static-plan enforcement (e.g. fixed program synthesis) is brittle on open-ended work. Dynamic, task-based access control degrades more gracefully — see task-based tool authorization.
Keep defense-in-depth. Pair lightweight runtime checks with instruction-hierarchy training and least-privilege scoping rather than betting on one filter.
Constrain blast radius. Even ~9% residual ASR is unacceptable for high-impact tools; gate sensitive actions behind human review and limit the lethal trifecta of private data, untrusted content and exfiltration paths.

Status

Defense family	Example	GPT-4o utility (no attack)	ASR	Failure mode
None	Vanilla	53.3%	37.8%	baseline
Prompting	Spotlighting	55.0%	27.6%	low security
Filtering	PromptGuard2	60.0%	27.2%	drops tool output under attack
Filtering	ProtectAI	~0%	~1%	severe over-defense
System-level	CaMeL	0%	0%	zero utility on open-ended tasks
Alignment	Meta SecAlign 70B	improved	~9%	best balance, residual risk

The authors stress that AgentDyn is “just a small open-ended benchmark,” yet every tested defense struggles on it — the gap to real deployment is larger still. Recent work converges on the same warning that clean leaderboard numbers can mislead (Adversa AI, June 2026; “measuring security without fooling ourselves,” May 2026). The defensive posture that follows is not “pick the defense with the lowest ASR” but “verify it on tasks that look like yours, and keep the layers you would need if it fails.”

This article summarizes published research for defensive and educational purposes. It contains no operational attack payloads.