system: OPERATIONAL
← back to all hacks
RESEARCH MEDIUM NEW

StakeBench: who actually pays when a web agent gets injected?

A stakeholder-centric benchmark from NTU, IBM Research and UIUC shows web agents fail every injection objective tested — and that the harm often lands on third parties, not the user.

2026-06-12 // 6 min affects: gpt-5, gemini-2.5-flash, nanobrowser, browser-use

What is this?

StakeBench is a prompt-injection benchmark for real-world web agents, introduced in a paper submitted to arXiv on June 11, 2026 (arXiv:2606.13385) by researchers from Nanyang Technological University, ST Engineering, IBM Research and the University of Illinois Urbana-Champaign. Its core argument: existing benchmarks are attack-centric — they measure whether an injection technically succeeds — while in real deployments the question that matters is victim-dependent: who bears the harm when an agent is manipulated. The same exploit can hurt the user, a third-party seller, or the platform, with very different severity and visibility.

How it works

StakeBench grounds its evaluation in online shopping, instantiated on OneStopMarket from VisualWebArena — a functional e-commerce environment where untrusted content (reviews, ratings, product metadata) flows straight into the agent’s context. The benchmark organizes 12 attack objectives by the stakeholder bearing the harm (User, Seller, Platform), realized through 22 reusable templates (9 direct, 13 indirect injection) and instantiated across 12 product categories, yielding 264 executable adversarial cases.

Each run is scored on three axes: Attack Success Rate (ASR), Task Deviation Rate (TDR — did the user’s delegated task get disrupted?) and Behavioral Irregularity Rate (BIR — did execution destabilize?). ASR and TDR jointly define four failure regimes:

RegimeASRTDRMeaning
Robust Behaviorlowlowattack fails, task completes
Stealthy Parasitismhighlowattack succeeds, user sees nothing wrong
Misaligned Disruptionlowhighattack fails but wrecks the task
Compounded Failurehighhighboth objective and task integrity violated

The authors evaluated two production-style agent systems — NanoBrowser (multi-agent browser extension, separate planning and navigation modules) and BrowserUse (single-agent iterative control loop) — each paired with GPT-5 and Gemini-2.5-Flash as backbones.

Why it matters

The headline numbers are bad across the board: indirect prompt injection achieved an ASR between 41.67% and 68.16% in every configuration tested, and not a single attack objective was reliably resisted. But the stakeholder lens is what makes this paper useful. Some attacks succeed without disrupting the user’s task at all — harming third-party sellers behind the appearance of perfectly normal agent behavior (stealthy parasitism). Conventional, user-centric evaluation literally cannot see this failure mode: the task completed, the user is happy, and someone else paid the price.

Two more findings deserve attention. First, backbone choice dominates architecture: switching from GPT-5 to Gemini-2.5-Flash raised IPI ASR by 26.49 points on NanoBrowser and 6.2 points on BrowserUse, with BrowserUse-Gemini hitting the worst TDR (45.09%) and BIR (28.85%) of all configurations. Second, a preliminary experiment with visual manipulation of product images suggests the injection surface extends beyond text — rating signals alone did not neutralize visual influence.

Defenses

The paper characterizes vulnerability and leaves defense evaluation to future work, but its findings translate directly into practice. Model your threat surface per stakeholder: ask not just “can my agent be injected?” but “who is harmed if it is?” — user-facing task success is not evidence of security. Treat reviews, ratings and product metadata as untrusted input channels into the agent’s context, and apply provenance separation or sanitization before they reach the model. Benchmark backbone swaps before shipping them: StakeBench shows the model choice can move ASR by over 26 points on an identical architecture. Monitor process-level signals (tool-call irregularity, navigation instability — the BIR analogue) rather than only task outcomes, since stealthy parasitism leaves outcomes intact. And for marketplace operators: agent-mediated purchases shift fraud incentives toward content-level manipulation of agents, which deserves its own abuse-detection pipeline.

Status

ItemDetail
PaperarXiv:2606.13385, submitted June 11, 2026
Benchmark264 cases, 22 templates, 12 objectives — public on GitHub (StakeBench/SBC)
Systems testedNanoBrowser and BrowserUse, with GPT-5 and Gemini-2.5-Flash
Worst-case IPI ASR68.16% (range 41.67–68.16% across configurations)
Patch statusNot a single-vendor flaw — a measurement of a systemic weakness in web agents

Sources