system: OPERATIONAL
← back to all hacks
AGENTS MEDIUM NEW

User-mediated attacks: when the user is the injection channel

A January 2026 study of 12 commercial agents shows attackers don't need to touch the agent. They trick a benign user into forwarding poisoned content — which the instruction hierarchy then promotes to trusted user intent. Default bypass rates topped 92%.

2026-06-19 // 6 min affects: llm-agents, web-use-agents, planning-agents

What is this?

User-mediated attacks are a class of agent compromise in which the adversary never touches the agent at all. Instead they manipulate a benign user into relaying attacker-controlled content into their own agent request. The paper that names and measures this pattern — “Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents” by Chen, Wu, Nguyen and Rudolph (Monash University and CSIRO’s Data61) — was posted to arXiv in January 2026 (2601.10758). It evaluated 12 commercial agents (6 trip-planning agents, 6 web-use agents) in a sandbox and found them “too helpful to be safe” by default.

The finding that matters: safety is capability-present but priority-absent. The agents could enforce safety checks, but only did so when the user explicitly asked. With no safety request, trip-planning agents bypassed safety constraints in over 92% of cases, and several web-use agents reached a 100% bypass rate on risky actions.

How it works

Most agent-security research assumes an “attacker-in-the-loop”: the adversary feeds the agent malicious input directly, or poisons a corpus the agent retrieves from. User-mediated attacks invert that. The attacker controls only what the user sees, through a four-step pipeline:

  1. Seed. The attacker publishes a persuasive, benign-looking post on a public platform (a Reddit thread, a “limited-time discount”, a how-to). It carries a payload — a disguised URL, a redirection chain, or procedural steps.
  2. Forward. A user encounters it while browsing and pastes or quotes it into their agent: “book this trip with this promo”, “follow these steps”. The poisoned content crosses from the open web into the agent’s input.
  3. Execute. The agent plans and acts on the now-biased context.
  4. Amplify. The agent’s reassuring output (“this looks official, safe to proceed”) raises the user’s confidence and drives the final harmful approval.

The mechanism is what the authors call instruction-source escalation. Modern defenses rank instructions by trust — system prompt > user input > external/model output. Indirect prompt injection is treated as low-trust external data. But when content arrives via the user’s own message, the hierarchy re-labels it as high-priority user intent. The same payload that a filter would distrust as web text is laundered into a trusted channel by the forwarding step. No payload is reproduced here; the lesson is structural, not a recipe.

The measured failure modes are concrete. URL verification was shallow and overconfident: agents asserted that cybersquatting, typosquatting and Cyrillic-homograph domains were “official” without real validation, and changing only a URL prefix bypassed checks 88% of the time even under a soft safety request. Web-use agents opened malicious links across http, https, data and javascript schemes, relying on browser blacklists rather than agent-side reasoning. They executed actions by task progress alone, scrolled past explicit malicious content, filled hidden DOM fields, and re-submitted data when a spoofed “failure” message appeared — silently exfiltrating it while user and agent both believed they were retrying.

Why it matters

This closes a gap that input filtering does not cover. The dominant defensive assumption is that the attacker lacks access to the agent, so securing the agent’s inputs is enough. User-mediated attacks satisfy that assumption and still succeed, because the human is the vector. The March 2026 survey From Secure Agentic AI to Secure Agentic Web (2603.01564) frames the same shift: as agents move from a controlled tool surface to the open, human-populated web, the trust boundary stops being the API and becomes everything the user can be persuaded to relay.

It also raises the stakes of helpfulness. An agent that advises a bad booking is recoverable; a web-use agent that clicks, submits and pays produces immediate, irreversible harm. The study found agents lack a “least-action stopping rule” — they keep interacting past the user’s actual objective, treating every available control as a legitimate command. The danger is not a missing safety model; it is that safety is opt-in by user phrasing.

Defenses

  • Make safety the default, not a prompt-triggered mode. Agents should run risk checks on every task involving external resources or money, whether or not the user asks. Do not rely on users to say “be careful” — soft requests still bypassed checks up to 55% of the time.
  • Treat user-forwarded content as untrusted, not as user intent. Quoted posts, pasted links and “follow these steps” payloads should keep external-data trust level even when they arrive in the user’s message. Resist the instruction-hierarchy escalation that promotes them.
  • Verify URLs properly. Apply Unicode/IDN normalization, check provenance and full registered domain (not prefix or registered-domain similarity), and never assert “official” or “verified” without an actual check. Overconfident reassurance is itself part of the attack.
  • Gate execution on necessity. Add a least-action stopping rule: each click, submit or download should be justified by the stated task. Stop at task completion instead of exhausting every interactive element on the page.
  • Verify backend state, not just front-end signals. Confirm whether a submission actually succeeded before retrying, so spoofed “failure” messages cannot drive silent re-submission and exfiltration.
  • Build user-side defenses. This is the channel currently left undefended. Warn users when their forwarded content contains links or directives, and surface what the agent is about to act on before it acts.

Status

ItemDetail
SourcearXiv 2601.10758 (Jan 2026)
Scope12 commercial agents: 6 trip-planning, 6 web-use
Core findingSafety is conditional on user phrasing, not default
Default bypass>92% (planning); up to 100% (web-use risky actions)
Under soft safety requestBypass still up to ~55%
ClassUser-mediated injection / instruction-source escalation
DisclosureAcademic study; behaviors measured in sandbox, no live harm

The durable framing is that the instruction hierarchy can be turned against itself. Ranking the user above external data is the right call for normal use — but it means an attacker who reaches the user, rather than the agent, inherits the user’s trust level. As long as safety checks are something a user has to ask for, the most polite and helpful agent is also the most exploitable.

Sources