AGENTS MEDIUM NEW

Zombie agents: when a self-evolving LLM agent stays compromised across sessions

A one-time indirect injection observed during a benign session can be written to an agent's long-term memory and later replayed as instruction — turning a transient prompt into persistent control. Attack paper dated February 2026, defense (CAMS) May 2026.

2026-06-18 // 7 min affects: llm-agents, self-evolving-agents, memory-based-agents, rag-agents

What is this?

“Self-evolving” agents are LLM agents that update their own internal state between sessions: they write summaries, successful trajectories, user preferences, or retrieved facts into a long-term memory store and read it back on later runs. The Zombie Agents paper (arXiv, February 2026) studies a failure mode specific to this design. An attacker who controls untrusted content the agent merely observes during one ordinary session — a web page, a document, a tool result — can get a payload written into that memory and then treated as a trusted instruction in future sessions. The result is persistence: a single, one-time injection becomes durable, hands-off control. The authors call the compromised agent a “zombie.”

The point is structural, not a single product bug. It generalizes the older observation (e.g. MINJA, arXiv March 2026) that memory-backed agents can be steered through normal user interaction with no elevated privileges, and pushes it into agents that rewrite their own state over time.

How it works

The chain has three stages, all using documented, public research framing rather than any working exploit:

Ingestion. During a benign task, the agent processes attacker-controlled external content. Because self-evolving agents persist what they see — observations, “successful” experiences, distilled notes — some of that content is written to long-term memory.
Promotion to instruction. On a later session, the memory retriever surfaces the stored item as relevant context. The agent has no reliable boundary marking it as data observed rather than instruction to follow, so it can act on it. This is the core data-vs-instruction confusion, now displaced in time.
Self-reinforcement. The paper’s contribution is showing the payload can be designed to survive common memory hygiene — truncation, relevance filtering, summarization — and even to re-write itself back into memory each time it fires, so the compromise outlives the session that created it.

Key dates to weigh freshness: the attack framing is February 2026; the query-only memory-injection precursor (MINJA) is March 2026; a dedicated attack-and-defense study on memory-based agents appeared January 2026 (arXiv 2601.05504). No payloads are reproduced here.

Why it matters

Most prompt-injection defenses are per-session: they filter the current input or output. This class of attack is explicitly designed to defeat that assumption. If the malicious instruction is dormant in memory and only activates on a later trigger, a clean input filter at run time sees nothing wrong. The blast radius grows with autonomy and memory persistence: long-running assistants, agents that accumulate user history, and multi-user deployments where one user’s poisoned memory could influence another are the most exposed. In regulated domains — the CAMS authors use electronic health record agents as their example — durable, silent behavioral drift is a serious integrity and confidentiality concern.

Defenses

Defending persistence means treating the memory store as an untrusted, security-relevant boundary rather than a convenience cache. The Cognitive Autonomous Memory Security (CAMS) framework (ScienceDirect, May 2026) proposes a five-layer middleware that requires no change to the underlying model and is a useful checklist even if you build your own:

Write-time gating. A “WriteGuard” pipeline and semantic-intent screening on everything before it enters long-term memory — the cheapest place to stop ingestion of injected instructions.
Provenance and zero-trust storage. Tamper-evident records of where each memory came from, so observed external content is never silently promoted to trusted instruction.
Temporal drift monitoring. Watch embedding drift and sequence evolution over time to catch slow, progressive poisoning that any single check would miss.
Cross-memory / graph reconstruction. Correlate related entries to detect attacks split across multiple stored items or multiple users.
Periodic re-scanning. A long-term-memory scanner that re-evaluates already-stored memories, since an item can become malicious in context only later.

Complementary engineering controls: separate “what the agent saw” from “what the agent should do” at the schema level; scope memory per user and per trust level; require human confirmation before high-impact actions sourced from retrieved memory; and apply the lethal-trifecta logic — be most cautious when an agent combines persistent memory, exposure to untrusted content, and the ability to act or exfiltrate.

Status

This is published academic research on a class of weakness in self-evolving and memory-based agents, not a vulnerability in a specific named product, and no exploit payloads are disclosed. The attack analysis (Zombie Agents) is dated February 2026; the foundational memory-injection work (MINJA) March 2026; and the CAMS defense May 2026 — placing the freshest source within the last ~90 days. Builders of memory-backed agents should assume per-session input filtering is necessary but not sufficient, and add write-time gating, provenance, and drift monitoring on the memory store itself.