system: OPERATIONAL
← back to all hacks
DEFENSE MEDIUM

Output filtering beats model self-defense: 20,000 adaptive attacks, one survivor

Posted April 26 and revised May 12, 2026, a Swept AI / Michigan paper pitted nine prompt-injection defenses against an adaptive attacker. Every model-side defense eventually broke. Application-side output filtering held — zero leaks across 15,000 attacks.

2026-05-22 // 6 min affects: gpt-4, claude-3, gemini, open-source-llms, system-prompt-secrets

What is this?

On April 26, 2026, Priyal Deep and co-authors at Swept AI and the University of Michigan posted Evaluation of Prompt Injection Defenses in Large Language Models on arXiv (2604.23887). The paper was revised on May 12, 2026 (v2). Its method is brutal in the right way: an adaptive attacker that evolves over hundreds of rounds, tested against nine defense configurations, in 20,000+ attacks. The finding lands hard: every defense that relies on the model to police itself eventually breaks. The only configuration that held was output filtering — hardcoded rules running in separate application code that inspects the model’s response before it reaches the user. Across 15,000 attacks against that configuration, the leak count was zero.

The result is not “another jailbreak paper”. It is a systematic answer to a question teams keep asking: which of the defenses we’re already shipping actually survive an attacker who adapts?

How it works

The threat model assumes a hostile user, an LLM application that holds a secret in its system prompt (an API key, an internal instruction, a customer record), and a defender free to combine any of the standard mitigations. The attacker is not a script with a fixed payload. It is itself an LLM-driven loop that observes the response, refines its strategy, and tries again. The paper runs this loop for hundreds of rounds against each defense and reports the cumulative leak rate.

Nine defense configurations were tested. They fall into two families:

# Conceptual taxonomy — illustrative, not an exploit.
# A. Model-side defenses (the model is asked to refuse)
#    1. Plain system prompt with "do not reveal the secret"
#    2. Spotlighting / delimiters around untrusted input
#    3. Instruction hierarchy prompts
#    4. Self-reflection: "Did your last output contain the secret?"
#    5. Paraphrase / sandwich defenses
#    6. Prompt-level classifier (model decides if input is malicious)
#    7. Combined model-side stack (several of the above together)
#
# B. Application-side defenses (code outside the model)
#    8. Input filter via separate classifier
#    9. OUTPUT FILTER: hardcoded regex/string checks on the response
#       before it is delivered to the user — zero leaks in 15,000 attacks

The asymmetry is the point. A defense in family A asks the very system being attacked to decide whether the attack worked. Given enough rounds, the adaptive attacker found a phrasing the defender’s own model accepted. Family B treats the model as an untrusted component and validates its output independently. The output filter in the paper is not clever AI — it is deterministic application code that knows what shape the secret takes and refuses to emit responses that contain it.

Why it matters

This is the third independent paper in eighteen months pointing the same direction. The Instruction Hierarchy (OpenAI, 2024) admitted the model alone cannot enforce a hierarchy under adversarial pressure. ARGUS (Weng et al., May 5, 2026) moved the decision to a provenance graph outside the model. Deep et al. now generalise the negative result: any defense that ends in “…and the model decides” eventually loses to an adaptive attacker. The corollary is constructive — security boundaries belong in code that does not run inside the LLM.

For teams shipping LLM applications today, this reframes what “prompt injection mitigation” should mean on a roadmap. Spotlighting, sandwiching, instruction-hierarchy prompts and self-reflection are useful for clean-traffic UX, but they are not the boundary. The boundary is whatever runs after the model returns — and before the response touches a tool call, a database write, or the user’s screen.

Defenses

Practical implications drawn directly from the paper and from concurrent work cited above:

  1. Never put a secret in the system prompt that you would not also publish. If the secret were leaked, what would break? If the answer is “a lot”, move it out of the prompt entirely and behind a tool the model can call with scoped permissions.
  2. Implement deterministic output filters. Regex, string match, structural JSON validation, denylists of known sensitive tokens. Run them in application code before delivery. Treat any match as a hard refusal, not a warning.
  3. Stack input filtering and output filtering. Input filtering reduces the attack surface; output filtering catches what slipped through. The paper’s combined configurations performed best when filtering existed on both sides.
  4. Audit any “self-check” defense. If the mitigation involves the same model asking itself “was that safe?”, assume an adaptive attacker can defeat it. Use it for UX, not for security.
  5. Restrict sensitive-operation agents to trusted personnel until your output-side defenses have been verified independently. The authors explicitly frame this as an interim posture.

Status

ItemReferenceDateNotes
Paper (v1)arXiv:2604.23887v12026-04-2614 pages, 9 figures, cs.CR / cs.AI
Paper (v2 — current)arXiv:2604.23887v22026-05-12Latest revision
Defenses tested§3 of the paper2026-04-269 configurations, model-side and application-side
Attack budget§420,000+ adaptive attacks, hundreds of rounds
Best defenseOutput filtering0 leaks across 15,000 attacks
Related: ARGUSarXiv:2605.033782026-05-05Same direction: defend outside the model

The deeper takeaway: prompt injection is not a model problem to be solved by a better model. It is a systems problem, and like every other systems problem in security, the answer is to validate untrusted output in code you control.

Sources