PISmith: adaptive RL red-teaming keeps breaking injection defenses
A March 2026 paper trains an attacker model with reinforcement learning to stress-test prompt-injection defenses in a black-box setting — and 8 state-of-the-art defenses still fall, including on AgentDojo and InjecAgent.
What is this?
PISmith is a reinforcement-learning red-teaming framework, published on arXiv (2603.13026) in March 2026 by researchers at The Pennsylvania State University. Its purpose is defensive: it measures how robust today’s prompt-injection defenses really are when an attacker is allowed to adapt rather than replay a fixed list of payloads.
The result is blunt. Across 13 evaluation benchmarks and 8 published defenses — including filter-based detectors and prevention-trained models — PISmith finds that “state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks.” It extends the central finding of the October 2025 paper The Attacker Moves Second (Nasr, Carlini, Tramèr et al., arXiv:2510.09023), which bypassed 12 defenses with attack success rates above 90% despite most of them originally reporting near-zero rates. PISmith turns that one-off demonstration into a reusable, automated training loop.
How it works
PISmith treats prompt injection as a policy-learning problem. An attack LLM is trained with on-policy reinforcement learning to generate injected prompts, given only black-box access to the defended system — it can query the target and observe outputs, nothing more. This mirrors a realistic adversary who cannot see model weights or the defense’s internals.
The paper’s contribution is in making this training actually converge. Applying standard GRPO (the group-relative policy optimization popularized by DeepSeek) fails against strong defenses because of reward sparsity: almost every generated prompt is blocked, so the few successes are drowned out and the policy’s entropy collapses — it stops exploring before it finds a working strategy. PISmith adds two mechanisms to counter this:
- Adaptive entropy regularization — an entropy bonus that activates only when policy entropy drops below a cap, sustaining exploration without degenerating into random, incoherent text.
- Dynamic advantage weighting — amplifying the gradient contribution of rare successful rollouts in proportion to how rare they are, so scarce wins are not diluted by the mass of failures.
No exploit string is reproduced here, and none is needed to understand the lesson: the method is a general optimization recipe, not a specific payload, which is precisely why static defenses do not hold against it.
Why it matters
The paper exposes a structural tension rather than a single bug: defenses “cannot effectively maintain high utility in benign settings while simultaneously resisting adaptive attacks.” Tighten the filter and legitimate tasks break; loosen it and the adaptive attacker gets through.
This matters most for agents. PISmith was also evaluated in agentic settings on InjecAgent and AgentDojo, succeeding against both open-source and closed-source models (the paper names GPT-4o-mini and GPT-5-nano as targets). Those are the exact tool-using, document-reading configurations that production agents ship today. A defense that scores well on a fixed benchmark can still be defeated by an attacker who trains specifically against it — meaning a vendor’s “near-zero ASR” claim says little unless it was measured adaptively.
The practical takeaway aligns with where the field has landed in 2026: prompt injection has no reliable model-side fix yet, so robustness claims must be earned against strong, adaptive evaluation — not static test sets.
Defenses
PISmith is itself a defensive tool — the right response is to use this class of evaluation, then constrain architecture rather than trust the filter.
- Evaluate adaptively. Treat any “near-zero attack success rate” reported only against static payloads as unverified. Re-test defenses with adaptive, optimization-driven attackers (RL-, search-, or human-guided) before relying on them.
- Do not depend on a single filter. Both filter-based detectors and prevention-trained models were broken in the study. Use them as one layer, never the only one.
- Apply the Rule of Two. Keep any agent session below two of {untrusted input, sensitive data/systems, state-change or external communication}. This bounds blast radius even when the injection succeeds.
- Quarantine untrusted content. Pass web pages, emails and tool output to the model as data, not as authoritative instructions; strip or tag retrieved text in RAG pipelines.
- Bind capabilities to the caller, with short-lived tokens, so a hijacked agent cannot act beyond its user’s scope.
- Keep humans in the loop for any action that is irreversible or externally visible when all three risky properties are unavoidable.
Status
| Item | Date | Status |
|---|---|---|
| The Attacker Moves Second (arXiv:2510.09023) | Oct 10, 2025 | Public |
| PISmith (arXiv:2603.13026) | Mar 2026 | Public, code released |
| Defenses tested (8) | — | Vulnerable to adaptive attacks |
| Agentic benchmarks (InjecAgent, AgentDojo) | — | Bypassed on open- and closed-source models |
PISmith does not introduce a new attack class — it operationalizes adaptive red-teaming as a repeatable benchmark. The actionable message for defenders is the same one The Attacker Moves Second delivered, now harder to ignore: a defense is only as strong as the strongest attacker it was tested against.