The Defense Trilemma: why prompt-injection wrappers can't be complete
A Lean 4-verified April 2026 proof shows no continuous, utility-preserving input wrapper can block every prompt injection. Continuity, utility, and completeness cannot all hold at once.
What is this?
On April 7, 2026, a team led by Manish Bhatt and Sarthak Munshi posted The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? (arXiv:2604.06436, revised April 11). The paper does not describe a new attack. It proves a limit: no continuous, utility-preserving wrapper defense can make every output of a language model safe. A “wrapper” here is any defense that preprocesses the input before the model sees it — a sanitizer, a paraphraser, a guard prompt, a spotlighting or delimiter step, a constitutional rewriter.
Prompt injection sits at #1 in the OWASP Top 10 for LLM Applications, and the industry’s most common response is exactly this kind of wrapper. The trilemma says the popular approach has a mathematical ceiling: a defense can be continuous (similar prompts get similar treatment), utility-preserving (safe prompts pass through unchanged), and complete (every output is safe) — but you can only ever have two of the three. Unusually for security research, the full argument is machine-checked in Lean 4.
How it works
Model the model’s safety as a continuous score f(x) over the space of prompts, with a threshold τ: prompts below τ are safe, above it are unsafe, and exactly at τ is the decision boundary. A defense D remaps prompts before scoring. The three properties produce three results, each proven under a stronger hypothesis.
Boundary fixation (Theorem 4.1) is the kernel, and the intuition fits in a sentence. A utility-preserving defense leaves every safe prompt untouched, so its set of fixed points is closed; but the safe region is open (it is the preimage of an open interval under a continuous score). In a connected prompt space, a non-empty proper open set cannot also be closed — so the fixed points must spill onto the boundary itself. Translation: some prompts sitting exactly on the safe/unsafe line pass through the defense unchanged. The defense cannot clean the very inputs closest to the edge.
The ε-robust constraint (Theorem 5.1) extends this from a point to a region. Under a Lipschitz-regularity assumption (the defense can’t move scores arbitrarily fast), a whole positive-measure band of near-boundary prompts stays near-threshold after defense. You cannot uniformly push that neighborhood to safety.
The persistent unsafe region (Theorem 6.3) is the strongest: under a transversality condition — when the alignment surface rises faster than the defense can pull it down — a positive-measure set of inputs remains strictly unsafe after the defense runs. Not borderline; unsafe.
The authors frame the conclusion as a no-free-lunch result for defenses, analogous to Wolpert’s classic theorem that no optimizer dominates on all problems. They also extend the impossibility to multi-turn conversations, stochastic (randomized) defenses, and nonlinear agent pipelines, where they note that tool calls amplify the failure rather than contain it.
Why it matters
The result is a precise diagnosis of why the “sanitize the input” reflex keeps disappointing. Two of the trilemma’s three properties are things defenders actively want: continuity (so the system behaves predictably) and utility preservation (so legitimate users aren’t blocked). Holding both, completeness is mathematically out of reach — there will always be a positive-measure set of inputs the wrapper waves through. An attacker only needs that set to be non-empty.
This pairs with our coverage of contextual integrity and why data-instruction separation is a category mistake: one paper argues semantically that the data/instruction line is the wrong frame, the other proves topologically that the wrapper built on it cannot be complete. Both point the same way — patching the input stream is structurally insufficient.
The dates matter for how much weight to give this. The preprint is from April 7, 2026, and as a formal-methods claim its credibility rests on the proofs, not on peer-review status: the Lean 4 development reports 45 files, ~350 theorems, zero sorry (admitted) statements, and three standard axioms, with the artifact published openly. The empirical section validates the predictions on three models — Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini — by measuring the predicted near-boundary failure bands.
A caution on scope, which the authors stress: the theorems constrain only a single continuous wrapper D: X → X on a connected prompt space. They are not a proof that LLMs cannot be defended.
Defenses
The paper’s value for defenders is that it tells you exactly which design space is dead and which is still open. Crucially, the impossibility does not cover:
- Training-time alignment — RLHF, DPO, constitutional-AI training that changes
fitself rather than wrapping its input. - Architectural changes to the model.
- Discontinuous defenses — hard blocklists and discrete classifiers that are allowed to reject, not remap. Giving up continuity is a legitimate escape from the trilemma.
- Output-side filters, ensembles, and human-in-the-loop review.
- Multi-component systems that reject or redirect inputs instead of preserving utility on every prompt — i.e. systems willing to refuse some legitimate-looking inputs.
The authors’ “engineering prescription” follows from the math: make the boundary shallow (reduce how confidently the model sits near τ), reduce the Lipschitz constant (limit how fast scores can swing), reduce the effective dimension of the prompt space the defense must cover, and monitor, don’t eliminate, the boundary — accept that a residual region exists and instrument it instead of pretending a wrapper closed it.
The practical takeaway aligns with defense-in-depth orthodoxy, now with a proof underneath it: do not stake safety on a single input sanitizer. Treat wrappers as one shallow layer, push real safety into training and architecture, and design agent pipelines on the assumption that some injected inputs will reach the model — so the damage of a successful injection is bounded by least-privilege tool scopes and output controls, not by the wrapper’s promise to catch everything.
Status
| Item | Detail |
|---|---|
| Paper | The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? (arXiv:2604.06436) |
| Posted | 2026-04-07 (revised 2026-04-11) |
| Result | Continuity + utility preservation + completeness cannot coexist for a wrapper defense |
| Verification | Lean 4 + Mathlib: 45 files, ~350 theorems, zero sorry, three standard axioms |
| Empirical | Llama-3-8B, GPT-OSS-20B, GPT-5-Mini |
| Scope | Continuous wrappers on connected prompt spaces only; training-time and architectural defenses unaffected |