system: OPERATIONAL
← back to all hacks
AGENTS MEDIUM NEW

Sleeper Memory Poisoning: dormant attacks on stateful LLM agents

A May 2026 paper shows attackers can plant fabricated 'memories' through a document or webpage that lie dormant, then steer an assistant's actions across many later sessions.

2026-06-21 // 6 min affects: gpt-5.5, kimi-k2.6, memory-augmented-agents, stateful-llm-assistants

What is this?

On 14 May 2026 (revised 18 May), researchers from CISPA and collaborators — Sidharth Pulipaka, Stanislau Hlebik, Leonidas Raghav, Sahar Abdelnabi, Vyas Raina, Ivaxi Sheth and Mario Fritz — published Hidden in Memory: Sleeper Memory Poisoning in LLM Agents. It studies a security risk introduced by a feature that has become standard in assistants over the past year: persistent memory, where a model stores user-specific facts across sessions for personalization and continuity.

The paper’s contribution is to characterize a delayed attack the authors call sleeper memory poisoning. An adversary manipulates external context the assistant later reads — a document, a webpage, a code repository — to make the assistant store a fabricated memory about the user. Unlike a conventional prompt injection, which acts in the moment and disappears, the planted memory can stay dormant and re-emerge across many later conversations, long after the malicious content is gone from the context window.

How it works

The attack is best understood as a pipeline with three stages, each of which the paper evaluates independently: write, retrieve, and act.

  1. Write. The victim’s assistant ingests attacker-controlled content during a normal task (summarizing a page, reviewing a repo, reading a shared file). Buried in that content is text engineered not to trigger an immediate action, but to be remembered — phrased as a durable fact about the user or their preferences. The memory subsystem commits it to long-term store.
  2. Retrieve. In a later, unrelated conversation, the assistant’s retrieval step surfaces the poisoned entry as if it were a trusted, user-confirmed fact. The original injection is no longer present, so input-side filters have nothing to inspect.
  3. Act. The fabricated memory steers the model’s behavior — biasing answers, or in agentic setups, driving tool calls that match the attacker’s intent.

The measured rates are striking. Across stateful assistants, poisoned memories were successfully written up to 99.8% of the time on GPT-5.5 and 95% on Kimi-K2.6. Among cases where a poisoned memory was later retrieved, it produced attacker-intended agentic actions in 60–89% of evaluations across the tested models. The numbers are high because the write step exploits exactly the behavior memory features are designed for: eagerly capturing anything that looks like a useful, lasting fact.

A companion June 2026 study, From Untrusted Input to Trusted Memory by Dash, Ge, Jain, Shah and Shang, makes the structural picture explicit. It identifies four memory write channels and nine structural weaknesses across model behavior, system-prompt design, and agent architecture, organizes attacks into six classes, and ships MPBench to measure them (we covered it in MPBench: a shared map for memory poisoning). Its headline finding lines up with the sleeper result: agents tuned to write and retrieve memory more aggressively are more exploitable, and existing prompt-injection defenses do not cover memory poisoning.

Why it matters

Memory turns a one-shot injection into a persistent foothold. The defining property of the sleeper variant is that the write and the payoff are separated in time, which breaks the mental model most defenses are built on. A team can scan every incoming prompt, find nothing, and still have a compromised assistant — because the malicious instruction was committed weeks earlier and now lives inside what the system treats as trusted user state.

This is the lethal trifecta extended along the time axis: access to private data, exposure to untrusted content, and an ability to act, now joined by durability. It generalizes the dormancy idea seen in Trojan Hippo and temporal memory contamination, and it is demonstrated on current commercial assistants, not toy setups. Any deployment where an agent both reads third-party content and keeps long-term memory — personal assistants, coding agents with project memory, support bots that remember accounts — inherits this surface.

Defenses

The papers are diagnostic, but they point clearly at mitigations. Treat memory as an untrusted input boundary, never a trusted cache.

  1. Gate the write path, not just the read path. Input-side prompt-injection filters do not generalize to memory. Add a distinct check at the moment content is committed to durable store, and again at retrieval.
  2. Attach provenance and trust level to every stored item. Tag each memory with its source (user-confirmed, tool output, model reflection) and never let a document- or tool-sourced note be retrieved with the authority of a user-verified fact.
  3. Make memory writes least-aggressive by default. Both papers tie eager write/retrieve policies to higher exploitability. Require a relevance or confirmation threshold before persisting, and prefer ephemeral context when in doubt.
  4. Add a confirmation gate for high-impact memories. Anything that could later change tool authorizations, spending, or credential handling should not be self-writable without a human or policy check.
  5. Version and audit memory. Because write and trigger are time-separated, keep a trail of who or what wrote each entry and when, so a poisoned note can be traced after it fires — see agent audit-trail integrity and OWASP’s agent memory guard.
  6. Benchmark your own agent. Use MPBench (or its methodology) to enumerate which write channels your deployment actually exposes, rather than assuming one filter covers them.

Status

ItemReferenceDateNotes
Hidden in Memory: Sleeper Memory PoisoningarXiv 2605.153382026-05-14 (rev. 05-18)Defines delayed/dormant memory poisoning; full write→retrieve→act pipeline
Write success ratesPaper abstract2026-05-14Up to 99.8% (GPT-5.5), 95% (Kimi-K2.6)
Agentic-action rate among retrievalsPaper abstract2026-05-1460–89% across tested models
From Untrusted Input to Trusted Memory (MPBench)arXiv 2606.043292026-06-034 write channels, 9 weaknesses, 6 attack classes; PI defenses don’t cover memory

The takeaway is not that memory poisoning is brand new — it is that the sleeper framing shows how cleanly the attack hides in time, and the measured rates on real assistants show it is not hypothetical. If your agent has persistent memory and your only defense is an input-side prompt filter, assume you are not covered.

This article summarizes publicly available research for defensive and educational purposes. It reproduces no exploit code.

Sources