MPBench: a systematic taxonomy of memory poisoning in LLM agents
A June 3, 2026 arXiv study maps four memory write channels, nine structural weaknesses and six attack classes — and shows prompt-injection defenses don't cover memory poisoning.
What is this?
On June 3, 2026, five researchers — Pritam Dash, Tongyu Ge, Aditi Jain, Tanmay Shah and Zhiwei Shang — posted From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents to arXiv (cs.CR/cs.AI). It is not a new attack. It is a systematization: the paper takes the scattered body of memory-poisoning results published over the last two years and organizes it into a taxonomy, then ships a benchmark, MPBench, to measure it.
Memory poisoning is the problem of an agent treating untrusted input as if it were trustworthy long-term memory. The paper’s framing is the key idea: a single adversarial write can exert long-term influence over an agent’s later behavior, long after the conversation that planted it has ended. Where prompt injection is a one-shot hijack of the current turn, memory poisoning is a hijack that persists across sessions. This is the first published attempt to map that surface methodically rather than one exploit at a time.
How it works
The paper decomposes the problem along three axes. No payloads are reproduced here; the canonical reference is the arXiv PDF.
Axis 1 — WRITE CHANNELS (4)
How untrusted content reaches durable memory:
- conversational turns committed to long-term store
- tool / retrieval outputs written back as "experience"
- summarization or reflection steps that distill input into notes
- explicit user- or agent-issued memory writes
Axis 2 — STRUCTURAL VULNERABILITIES (9)
Why those channels are exploitable, grouped under:
- model capabilities (the model can't reliably tell data from instruction)
- system-prompt design (no provenance or trust labels on stored items)
- agent architecture (aggressive write/retrieve policies, no review gate)
Axis 3 — ATTACK CLASSES (6)
Six families of poisoning derived from the channel × weakness matrix
Two findings matter most for practitioners. First, aggressiveness is a liability: agents tuned to write and retrieve memory more eagerly — the same tuning that makes them feel “smart” and personalized — scored as more exploitable on MPBench. The convenience knob is also the risk knob. Second, and more pointed: the authors test existing prompt-injection defenses and find they do not cover memory poisoning. A filter that inspects the current prompt has nothing to say about a malicious note that was written into memory days earlier and is now retrieved as trusted context.
This connects to attacks the research community already documented — for example AgentPoison, which showed memory/knowledge-base poisoning of agents back in 2024 — and to our prior coverage of OWASP’s ASI06 memory-poisoning category, dormant memory exfiltration and temporal memory contamination. What 2606.04329 adds is the connective tissue: a shared vocabulary and a measurement harness.
Why it matters
Memory is now a default feature, not a research toy. Assistant products ship persistent memory, agent frameworks write “experiences” back to vector stores, and RAG pipelines blur the line between retrieved data and instructions. Every one of those is a write channel in the paper’s sense.
The defensive implication is uncomfortable. Most teams that adopted an input-side prompt-injection filter in 2025 implicitly assumed it generalized. This paper is evidence that it does not. A poisoned memory is, by construction, trusted by the time it is read — it has already crossed the trust boundary the filter was guarding. The exposure is also asymmetric in time: the write and the trigger can be separated by days or sessions, which defeats per-request monitoring and complicates incident forensics, because the malicious turn may have aged out of your logs.
A taxonomy and a benchmark are exactly what this area needed. They let teams ask concrete questions — which of the four channels does my agent expose, which of the six classes can I reproduce against my own stack — instead of arguing about anecdotes.
Defenses
The paper is diagnostic rather than prescriptive, but its structure points directly at mitigations. Treat memory as an untrusted input boundary, not a trusted cache.
- Label provenance on every stored item. Tag memory entries with their source (user, tool output, model reflection) and trust level, and never let a tool- or document-sourced note be retrieved with the same authority as a verified instruction.
- Gate the write path, not just the read path. Input-side prompt-injection filters do not generalize to memory; add a distinct check at the moment content is committed to durable store, and again at retrieval.
- Make memory writes least-aggressive by default. The MPBench finding is explicit: eager write/retrieve policies are more exploitable. Require a relevance or review threshold before persisting, and prefer ephemeral context over durable memory when in doubt.
- Add a human or policy review gate for high-impact writes. Memory that can change future tool authorizations, credentials handling, or spending decisions should not be self-writable without a check.
- Retain and version memory for forensics. Because write and trigger are time-separated, keep an audit trail of who/what wrote each entry and when, so a poisoned note can be traced after it fires. See our note on agent audit-trail integrity.
- Benchmark your own agent. Use MPBench (or its methodology) to enumerate which write channels and attack classes your deployment actually exposes, rather than assuming a single filter covers them.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| arXiv 2606.04329 v1 | arXiv (cs.CR/cs.AI) | 2026-06-03 | Submitted; systematic study + MPBench benchmark |
| Four write channels / nine weaknesses / six classes | Paper abstract | 2026-06-03 | Taxonomy across model, prompt, architecture |
| ”Aggressive memory ⇒ more exploitable” | Paper finding | 2026-06-03 | Measured on MPBench |
| ”Prompt-injection defenses don’t cover memory poisoning” | Paper finding | 2026-06-03 | Key gap for existing deployments |
| Foundational prior work (AgentPoison) | arXiv 2407.12784 | 2024 | Earlier memory/knowledge-base poisoning attack |
The right takeaway is not “memory poisoning is new” — it isn’t. It is that the field finally has a shared map and a ruler. If your agent has persistent memory and your only defense is an input-side prompt filter, this paper is the documented reason to assume you are not covered, and a structured way to find out where the gaps are.
This article summarizes publicly available research for defensive and educational purposes. It reproduces no exploit code.