StakeBench: who actually pays when a web agent gets injected?
A stakeholder-centric benchmark from NTU, IBM Research and UIUC shows web agents fail every injection objective tested — and that the harm often lands on third parties, not the user.
What is this?
StakeBench is a prompt-injection benchmark for real-world web agents, introduced in a paper submitted to arXiv on June 11, 2026 (arXiv:2606.13385) by researchers from Nanyang Technological University, ST Engineering, IBM Research and the University of Illinois Urbana-Champaign. Its core argument: existing benchmarks are attack-centric — they measure whether an injection technically succeeds — while in real deployments the question that matters is victim-dependent: who bears the harm when an agent is manipulated. The same exploit can hurt the user, a third-party seller, or the platform, with very different severity and visibility.
How it works
StakeBench grounds its evaluation in online shopping, instantiated on OneStopMarket from VisualWebArena — a functional e-commerce environment where untrusted content (reviews, ratings, product metadata) flows straight into the agent’s context. The benchmark organizes 12 attack objectives by the stakeholder bearing the harm (User, Seller, Platform), realized through 22 reusable templates (9 direct, 13 indirect injection) and instantiated across 12 product categories, yielding 264 executable adversarial cases.
Each run is scored on three axes: Attack Success Rate (ASR), Task Deviation Rate (TDR — did the user’s delegated task get disrupted?) and Behavioral Irregularity Rate (BIR — did execution destabilize?). ASR and TDR jointly define four failure regimes:
| Regime | ASR | TDR | Meaning |
|---|---|---|---|
| Robust Behavior | low | low | attack fails, task completes |
| Stealthy Parasitism | high | low | attack succeeds, user sees nothing wrong |
| Misaligned Disruption | low | high | attack fails but wrecks the task |
| Compounded Failure | high | high | both objective and task integrity violated |
The authors evaluated two production-style agent systems — NanoBrowser (multi-agent browser extension, separate planning and navigation modules) and BrowserUse (single-agent iterative control loop) — each paired with GPT-5 and Gemini-2.5-Flash as backbones.
Why it matters
The headline numbers are bad across the board: indirect prompt injection achieved an ASR between 41.67% and 68.16% in every configuration tested, and not a single attack objective was reliably resisted. But the stakeholder lens is what makes this paper useful. Some attacks succeed without disrupting the user’s task at all — harming third-party sellers behind the appearance of perfectly normal agent behavior (stealthy parasitism). Conventional, user-centric evaluation literally cannot see this failure mode: the task completed, the user is happy, and someone else paid the price.
Two more findings deserve attention. First, backbone choice dominates architecture: switching from GPT-5 to Gemini-2.5-Flash raised IPI ASR by 26.49 points on NanoBrowser and 6.2 points on BrowserUse, with BrowserUse-Gemini hitting the worst TDR (45.09%) and BIR (28.85%) of all configurations. Second, a preliminary experiment with visual manipulation of product images suggests the injection surface extends beyond text — rating signals alone did not neutralize visual influence.
Defenses
The paper characterizes vulnerability and leaves defense evaluation to future work, but its findings translate directly into practice. Model your threat surface per stakeholder: ask not just “can my agent be injected?” but “who is harmed if it is?” — user-facing task success is not evidence of security. Treat reviews, ratings and product metadata as untrusted input channels into the agent’s context, and apply provenance separation or sanitization before they reach the model. Benchmark backbone swaps before shipping them: StakeBench shows the model choice can move ASR by over 26 points on an identical architecture. Monitor process-level signals (tool-call irregularity, navigation instability — the BIR analogue) rather than only task outcomes, since stealthy parasitism leaves outcomes intact. And for marketplace operators: agent-mediated purchases shift fraud incentives toward content-level manipulation of agents, which deserves its own abuse-detection pipeline.
Status
| Item | Detail |
|---|---|
| Paper | arXiv:2606.13385, submitted June 11, 2026 |
| Benchmark | 264 cases, 22 templates, 12 objectives — public on GitHub (StakeBench/SBC) |
| Systems tested | NanoBrowser and BrowserUse, with GPT-5 and Gemini-2.5-Flash |
| Worst-case IPI ASR | 68.16% (range 41.67–68.16% across configurations) |
| Patch status | Not a single-vendor flaw — a measurement of a systemic weakness in web agents |