system: OPERATIONAL
← back to all hacks
PROMPT INJECTION MEDIUM NEW

ASPI: asking the user to clarify widens the injection surface

A May 17, 2026 arXiv benchmark shows that when an agent pauses to ask the user for clarification, prompt-injection success climbs from under 2% to over 34% on o3 and Gemini-3-Flash.

2026-06-03 // 6 min affects: o3, gemini-3-flash, llm-agents

What is this?

On May 17, 2026, a Scale AI team (Udari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino and Max Fenkell) posted ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents on arXiv (2605.17324, cs.CR). The finding is uncomfortable because it implicates a behavior the whole industry treats as good hygiene: when a task is underspecified, a well-built agent should stop and ask the user what they meant before acting.

ASPI — Ambiguous-State Prompt Injection — is a benchmark of 728 task-attack scenarios that isolates “asking for clarification” as a distinct agent state and measures whether entering that state changes how easily the agent is hijacked. The answer, across ten frontier models, is yes, and by a lot. The data and harness are public at github.com/scaleapi/aspi.

How it works

The benchmark compares the same scenario under two matched settings. In the execution setting, the agent receives a fully specified instruction and only meets adversarial content indirectly, through data a tool returns. In the clarification setting, the instruction is underspecified, so the agent must first ask the user a question and fold the reply back into its plan before acting. Everything else is held constant — same task, same injected content, same tools — so any difference in attack success is attributable to the state transition itself.

Setting          Agent flow                                    Adversarial entry point
---------------  --------------------------------------------  -----------------------------
Execution        instruction -> act -> tool data               tool-returned content
Clarification    instruction -> ASK USER -> incorporate -> act  clarification interface + data

The measured gap is large. Attack success rises from 1.8% to 34.0% on o3 and from 2.2% to 35.7% on Gemini-3-Flash, with the same direction of effect across the rest of the ten models tested. A decomposition analysis splits the cause in two. There is a state-dependent shift: once the model is in “I’m resolving ambiguity” mode, it processes incoming content more credulously, treating instruction-like text as something to act on rather than data to scrutinize. And there is a channel-specific effect: the clarification reply is a second, agent-solicited input path that arrives pre-blessed as “the user answering my question,” which is a weaker boundary than tool output the agent already distrusts. The paper deliberately stops at characterizing the surface; it ships a benchmark, not a weaponized payload.

Why it matters

Most agent security evaluation is run in the execution setting — fully specified task, single adversarial channel — and ASPI’s core claim is that this systematically underestimates the real attack surface of interactive agents. Robustness on a clean, fully specified task does not transfer to robustness once the agent starts a back-and-forth with a user, which is exactly the mode production assistants spend much of their time in.

This connects to a broader theme running through the June 2026 agent-security literature: agents are brittle precisely at their interaction seams. Adversa AI’s June 1, 2026 roundup groups ASPI alongside work arguing that data-and-instruction separation may be fundamentally hard. The practical reading is that clarification turns are a privileged channel — and any privileged channel an attacker can influence becomes a target. If injected content can shape what the user is asked, or ride along in what the user pastes back, the agent meets it in its most suggestible state.

Defenses

Four mitigations follow directly from the paper’s framing, even though ASPI itself prescribes none.

  1. Evaluate agents in the clarification state, not only in execution. Add underspecified-task variants to your red-team suite. A model that passes a fully specified injection benchmark may still fail once it is mid-dialogue, and you will not see that on an execution-only leaderboard.
  2. Treat the clarification reply as untrusted input. The user’s answer is not a trusted control channel just because the agent asked for it. Run it through the same instruction-stripping, provenance tagging, and policy checks you apply to tool output.
  3. Keep the action policy fixed across state transitions. Decisions about scope, tool access, and irreversibility should not loosen because the agent moved into “resolving ambiguity” mode. Re-confirm high-impact actions against the original, pre-clarification objective.
  4. Prefer constrained clarification over free text. Where feasible, resolve ambiguity with bounded choices (pick one of N) rather than an open reply that can smuggle instructions, narrowing the channel the paper identifies.

Status

ItemReferenceDateNotes
ASPI paperarXiv:2605.17324 (cs.CR, cs.AI)2026-05-17728 scenarios, 10 frontier models, matched execution vs. clarification
Headline resulto3 1.8% → 34.0%; Gemini-3-Flash 2.2% → 35.7%2026-05-17Clarification state amplifies attack success
Data + harnessgithub.com/scaleapi/aspi2026-05Public benchmark for reproduction
ContextAdversa AI agentic-security roundup2026-06-01Lists ASPI under agent vulnerabilities

ASPI does not describe a patchable bug in one product; it describes a property of how today’s agents handle a state they are designed to enter often. The useful takeaway is narrow and actionable: if your agent ever asks a user “what did you mean?”, your security testing has to ask the same question back.

Sources