AGENTS MEDIUM NEW

MemPoison: backdooring agent memory through ordinary conversation

A May 2026 arXiv paper plants a triggerable backdoor in an LLM agent's long-term memory just by chatting with it — and is engineered to survive the selective extraction and rewriting stages meant to filter poisoned content.

2026-06-20 // 6 min affects: llm-agents, agent-memory-systems, rag-pipelines

What is this?

The paper Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction, posted to arXiv in May 2026, describes MemPoison: a way to plant a triggerable backdoor in an LLM agent’s long-term memory using nothing but ordinary conversation. The attacker needs no elevated privileges and no write access to the memory store. They talk to the agent — the same way any user would on a shared platform — and a dormant payload is left behind for later.

This is a refinement of a problem the field already knows. MINJA (NeurIPS 2025) showed that a user could inject malicious records into agent memory through queries alone, and the attack-and-defense study of January 2026 hardened the picture. What MemPoison adds is a direct answer to the obvious defense: many production agents do not store conversations verbatim. They run a selective memory pipeline that extracts, summarizes and rewrites content before anything is committed. Earlier attacks quietly assumed injected text lands in memory intact. MemPoison is built specifically to survive that pipeline — which is what makes it worth a closer look. It is closely related to cross-session stored prompt injection, but targets the memory-construction stage itself.

How it works

MemPoison runs in two phases. In the injection phase the attacker poisons memory through normal conversational turns on a shared agent. In the triggering phase the backdoor fires through one of two paths: user-triggered, where the trigger sits in external content (a web page) that an innocent user later asks the agent to read, causing the agent to retrieve and act on the planted payload; or attacker-triggered, where the attacker simply issues a query containing the trigger to elicit the malicious response on demand.

The contribution is in getting the payload past selective memory. The paper describes three components, which we summarize conceptually rather than reproduce:

A semantic relational bridge binds the trigger and the payload into a single coherent statement, so the extraction stage keeps them together instead of dropping one half.
Entity masquerading shapes the trigger to look like a named entity, so the rewriting stage preserves it rather than paraphrasing it away.
Joint embedding optimization pulls the trigger-bearing texts into a tight cluster in embedding space while keeping them separated from benign content, so retrieval reliably surfaces the payload on the trigger and stays quiet otherwise.

INJECTION  (attacker chats on a shared agent)
  conversational turns ──▶ selective memory pipeline
                            (extract → summarize → rewrite → embed)
                                         │  survives, by design
                                         ▼
                              LONG-TERM MEMORY  [trigger + payload]

   ... attacker leaves; the entry sits dormant ...

TRIGGERING
  (A) victim asks agent to read external content carrying the trigger
  (B) attacker re-queries with the trigger
        └──▶ retrieval surfaces the payload ──▶ agent emits malicious response

No working payload is reproduced here; the lesson does not need one. Reported attack success rates reach up to 0.95 while benign accuracy is preserved, and the backdoor stays effective against perplexity-based filtering and paraphrasing. The authors’ mechanistic analysis attributes the durability to embedding-space anisotropy and attention redistribution — i.e. it exploits structural properties of the memory system, not a single brittle string.

Why it matters

The unsettling part is the threat model, not the numbers. The attacker is an ordinary user of a shared agent, with no special access, leaving a trap that a different user trips later. A selective memory pipeline — exactly the mechanism teams add to feel safer about what gets remembered — is treated here as the attack surface rather than the defense. And because the payload is dormant and stealthy by construction, the detection window is measured in days, not in the turn that planted it.

This lands hardest on multi-tenant and team deployments: shared knowledge bases, common memory stores, agents that serve many users from one persistent state. In those settings a single poisoned conversation can surface in someone else’s session, turning a per-user nuisance into a persistence foothold.

Defenses

No single control removes this class. The goal is to stop trusting memory as authoritative context and to break the write-then-retrieve loop.

Treat retrieved memory as untrusted input. Re-validate persisted content on read with the same scrutiny applied to fresh external content. The root error is the agent trusting its own memory because “it’s already in there.”
Partition memory by provenance. Tag entries with their source and the task that wrote them. Content derived from untrusted conversation should be quarantined and never fed straight into planning or tool-selection prompts.
Isolate per user and per tenant. Prefer scoped memory over shared mutable stores. An entry written in one user’s session should not be retrievable in another’s by default — this alone neutralizes the cross-user trigger path.
Defend at retrieval, not only at write. Because MemPoison survives perplexity and paraphrase filters at write time, add retrieval-time checks: provenance gating, anomaly detection on which memories surface for a query, and entailment or consistency checks between a retrieved memory and the current request.
Gate and log every write. Make memory writes explicit and policy-bound — what may be written, by which task, from which source — with provenance logging so a later investigation can trace a behavior back to the conversation that planted it.
Expire aggressively and constrain egress. TTLs and forgetting limit how long a dormant entry can wait for its trigger; least-privilege tooling and egress monitoring shrink what a fired payload can actually do.

Status

Aspect	Naive memory poisoning	MemPoison (May 2026)
Access needed	Often assumes direct memory write	Conversation only, no privileges
Selective memory pipeline	Assumed bypassed / absent	Engineered to survive extract + rewrite
Trigger	Immediate or implicit	Dormant; user- or attacker-triggered
Reported success rate	Varies	Up to 0.95, benign accuracy preserved
Resists perplexity / paraphrase	Not reliably	Yes
Primary control	Write-time filtering	Provenance-aware retrieval + isolation

The takeaway from the May 2026 paper is a shift in where the defense has to sit. Filtering what goes into memory is not enough when an attacker can shape a payload to pass extraction and rewriting intact. In a stateful agent, the memory pipeline is part of the attack surface — and controls that trust retrieved memory as ground truth will keep missing backdoors that were built to look like memories.