DEFENSE MEDIUM NEW

WARD: a co-evolved guard model that holds up against adaptive prompt injection on web agents

A May 14, 2026 NUS paper proposes WARD — a guard model trained against a memory-driven adversarial attacker — and reports near-perfect out-of-distribution recall on web-agent prompt injection.

2026-05-29 // 7 min affects: web-agents, browser-use, llama-guard-4, prompt-guard-1, gpt-oss-safeguard

What is this?

On May 14, 2026, Tri Cao, Yulin Chen, Hieu Cao, Yibo Li, Khoi Le, Thong Nguyen, Yuexin Li, Yufei He, Yue Liu, Shuicheng Yan and Bryan Hooi (National University of Singapore, with co-authors at University of Science and Vietnam National University, Ho Chi Minh City) posted WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections on arXiv (2605.15030). The paper releases code and models under CC BY 4.0 at github.com/caothientri2001vn/WARD-WebAgent.

WARD is a guard model for web agents — a side-car classifier that runs in parallel with a browser agent and flags prompt-injection content embedded in HTML or screenshots before the agent acts on it. The contribution is twofold: a large-scale, carefully labelled dataset (WARD-Base, 177,585 samples over 709 URLs and 10 platforms), and a two-loop adversarial training procedure (A3T) that hardens the guard against attacks crafted specifically to bypass it.

The headline result from the abstract is “nearly perfect recall on out-of-distribution benchmarks, low false positive rates to preserve agent utility, [and] robust[ness] against guard-targeted and adaptive attacks under substantial distribution shifts” while “running efficiently in parallel with the agent without introducing additional latency.”

How it works

The dataset and the training loop are the interesting pieces.

WARD-Base is built from the 50 most-visited URLs in 21 Similarweb categories (808 URLs after filtering) plus 20 simulated platforms across collaborative systems, email, messaging, e-commerce and social media. Five benign user tasks are generated per URL — 4,040 in total — and executed using a Browser-Use agent to record realistic page states. The authors then enumerate six attack goal categories — user information exfiltration, unauthorized action execution, policy-violating content generation, single-step UI manipulation, agent memory manipulation, and utility degradation — and four injection-location groups (HTML, screenshot, both, none). Injection channels include overlay channels (footer text, alert box, badge, banner, notification, inset chat, popup) and native channels (chat message, email body, post and comment, README, product description). The resulting set contains 90,802 benign and 86,783 malicious samples.

# Conceptual sketch of the WARD pipeline, from the paper.
# No exploit payload against any live system is reproduced.

[ web page or screenshot ]                 # observation modalities
              │
              ▼
[ benign user task ]
              │
              ▼
[ injection at location ℓ ∈ {HTML, Screenshot, Both, None} ]
              │
              ▼
[ WARD guard G_θ ] ──► (predicted label ŷ, predicted location ℓ̂)
              │                       │
              ▼                       ▼
   pass to agent             refuse + log to telemetry

The training pipeline has three stages. Stage 1 is supervised training on WARD-Base. Stage 2 introduces WARD-PIG, a dedicated set of 10,500 malicious and 10,500 benign samples in which guard-targeted prompts (Prompt Injection against Guard, “PIG”) are inserted into the same observation as the agent-targeted prompt — so the same content tries to fool both the agent and the guard at once. Stage 3 is A3T, an adaptive adversarial attack training loop:

Inner loop. An attacker model proposes adversarial prompts p_adv that simultaneously satisfy the attack goal and bypass the current guard G_θ. A validator filters candidates on goal consistency, contextual plausibility and validity as an injection. The attacker carries two levels of memory — sample-level (previous attempts, validator feedback, guard predictions, failure reasons) and platform-level (reusable strategies for the same URL or platform) — so successful patterns transfer and failed ones get refined.
Outer loop. The guard is fine-tuned with GRPO on the harvested adversarial samples, with a reward of 1.0 for correct label and correct injection location, 0.5 for correct label but wrong location, and 0.0 otherwise. The starting checkpoint is the WARD-PIG-trained model.

The two loops co-evolve: the attacker probes near the current decision boundary, the guard updates, the attacker probes again. The authors run this on WARD-Seed (49 URLs and 4 platforms held out from WARD-Base) and evaluate on WARD-Test (50 URLs and 6 platforms) plus four external benchmarks — Popup, EIA, VPI and WASP — none of which were used to train WARD.

The baseline comparison is unflattering for current production guards. Llama-Guard-4-12B records 23.2% recall and an F1 of 37.46 on WARD-Test, and on the Popup benchmark its recall collapses to 0.26%. GPT-Oss-Safeguard-20B reaches 40.67% recall on WARD-Test with an F1 of 57.82, but scores 0.00% on Popup, EIA-like and WASP-like injections. Prompt-Guard-1-86M is similarly weak. These numbers explain why a web-agent-specific guard is needed: generic LLM safety guards were not built for HTML and screenshot injection surfaces and they do not generalise to them.

Why it matters

Three reasons WARD is worth tracking, even if you do not deploy it as-is.

First, the threat model is the one most production teams are now shipping into. Browser-based agents — OpenAI Operator, Anthropic Computer Use, ChatGPT Agent, Browser-Use, Skyvern and a long list of open-source projects — operate on attacker-controlled HTML and pixels. As earlier work documented (the HTML accessibility tree IPI paper, arXiv:2507.14799, and Manipulating LLM Web Agents), an attacker who controls a page can hijack an agent that reads it. WARD takes that threat model as a first-class object instead of trying to retrofit a general-purpose safety classifier.

Second, the datasets are the contribution. Most public prompt-injection benchmarks (AdvBench, AgentDojo, InjecAgent) target chat or tool-call surfaces and contain at most a few thousand examples. A 177K-sample web-specific corpus with explicit channel and location labels — released under CC BY 4.0 — is something the defensive community did not have. Even teams that ignore WARD-the-model can train and evaluate their own guards on WARD-Base.

Third, the co-evolution training pattern is portable. A3T’s inner-attacker / outer-guard structure can be applied to other guard surfaces (chat-template guards, tool-result classifiers, MCP server filters) provided you can write a validator that checks attack-goal satisfaction. It generalises a pattern that earlier work in adversarial robustness (PGD-trained classifiers in vision, FGSM-style training for text models) only partially addressed for LLM-era surfaces.

Defenses

WARD is itself a defense paper, so the “defenses” here are the operational lessons for teams shipping web agents — whether or not they adopt WARD.

Run a side-car guard, not just an instruction-hierarchy prompt. The WARD numbers are a strong reminder that asking the planner LLM to “ignore injected instructions” is not a defense. A separate model that sees the same observation and votes on it, in parallel with the agent, costs little extra latency and catches a category of failure the planner does not. WARD’s design ships the guard at inference next to the agent without becoming the bottleneck.

Train on web-specific injection channels. Generic prompt-injection benchmarks (chat-style “ignore previous instructions” attacks) do not transfer to HTML and screenshot surfaces. If your agent reads pages, your eval has to include overlay channels (popups, banners, notifications) and native channels (messages, comments, README, product descriptions). WARD-Base is a credible starting corpus.

Use the four-class injection-location label. Treating injection as a binary classifier loses information. WARD predicts both label and location (HTML / screenshot / both / none); this lets a downstream policy decide differently when the attack is text-only vs visual, and gives your telemetry the granularity to identify which channel an adversary is exercising.

Stress-test the guard itself. WARD-PIG is the part most existing deployments are missing. If your guard is a fixed model card or a fixed system prompt, an attacker can iterate against it offline until they find content that bypasses it. The defensive response is to incorporate guard-targeted attacks into the training set, then re-evaluate.

Adopt adversarial co-evolution where you can. Most teams cannot afford to maintain an A3T-style loop in production, but the paper’s structure is reproducible at smaller scale. Even one or two rounds of “generate adversarial samples → fine-tune the guard → re-test” measurably hardens a deployed filter, and the platform-level memory pattern (storing what worked per-URL) is straightforward to implement on top of an existing red-team pipeline.

Do not assume Llama-Guard or similar is enough. The Llama-Guard-4-12B and GPT-Oss-Safeguard-20B numbers in the WARD paper are the most directly actionable finding: these are reasonable defaults for chat-content moderation, but on web-agent observation channels they recall less than half of malicious content and, in some benchmarks, less than 1%. If you currently rely on a generic safety classifier as your sole web-agent guard, the WARD paper is the prompt to re-test.

Status

Item	Reference	Date	Notes
arXiv submission	WARD v1, arXiv 2605.15030	2026-05-14	cs.CR / cs.AI
Authors	Cao, Chen, Cao, Li, Le, Nguyen, Li, He, Liu, Yan, Hooi	—	NUS + University of Science + VNU-HCM
Code & models	github.com/caothientri2001vn/WARD-WebAgent	2026-05-14	Public repository
License	CC BY 4.0	2026-05-14	Re-use permitted with attribution
WARD-Base	177,585 samples; 709 URLs + 10 platforms; 90,802 benign / 86,783 malicious	—	Six attack-goal categories, four injection-location groups
WARD-PIG	10,500 malicious + 10,500 benign with guard-targeted prompts	—	Trains the guard against guard-aware attacks
A3T	Inner attack-generation loop + outer GRPO guard-update loop	—	Sample- and platform-level memory
External benchmarks	Popup, EIA, VPI, WASP	—	Used for out-of-distribution evaluation
Compared baselines	Llama-Guard-4-12B; GPT-Oss-Safeguard-20B; Prompt-Guard-1-86M	2024–2025	All underperform on web-specific channels

Web-agent security is moving from “we have a planner that knows not to follow strange instructions” to “we have a guard that has been trained against attackers who know about the guard.” WARD is a credible reference design for that second mode, and the dataset alone is worth the read.