system: OPERATIONAL
← back to all hacks
AGENTS MEDIUM NEW

ClawTrojan: stored prompt injection becomes a persistent agent backdoor

A May 29, 2026 arXiv paper shows injection hidden in a file can be stored by a local agent and run later — reaching 95.5% attack success where single-turn injection scores near zero.

2026-06-03 // 6 min affects: gpt-5.4, llm-agents, coding-agents, openclaw

What is this?

On May 29, 2026, Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li and Ji-Rong Wen posted From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors (arXiv:2605.31042, cs.CR/cs.AI/cs.CL). The code and benchmark are released under the RUC-NLPIR GitHub organisation.

The paper’s contribution is not a new payload. It is a measurement of a blind spot: in a local agentic harness — where an LLM reads and writes files, calls tools, and reuses workspace state across sessions — an injected instruction does not have to act immediately. It can be stored and executed much later. The authors build a benchmark, ClawTrojan, to surface these multi-step trojans, and a defence, DASGuard, that traces where control-like text in a workspace actually came from. On an OpenClaw-style simulated workspace running GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while conventional single-turn prompt-injection attacks score near zero on the same model.

How it works

The attack splits a single malicious goal across several innocuous-looking steps. An attacker plants a prompt injection inside content the agent will encounter — a file, a tool output, a piece of retrieved context. The agent reads that hidden instruction and, crucially, writes it into the workspace: a config file, a notes file, a piece of persisted state. Later — possibly in a different session — the agent reads its own workspace back, treats the planted text as legitimate control content, and acts on it.

Step 1  Agent reads untrusted file/tool output containing hidden text
Step 2  Agent writes that text into a sensitive local file  [backdoor planted]
   ...  (looks like ordinary workspace maintenance)
Step N  Agent reads the file back and executes the planted instruction

The key insight is structural: no individual step looks malicious. A write to a config file is routine. Reading workspace state is routine. The harm only exists in the sequence. The paper notes that existing defenses inspect each step in isolation, so they may successfully block a clearly harmful final action, yet completely miss the earlier write operation that planted the backdoor in the first place. That gap is why single-turn injection scores near zero on GPT-5.4 while the multi-step variant succeeds 95.5% of the time — the model’s per-step refusal behaviour is intact; the workflow is what is being exploited.

This is the same lesson as temporal memory contamination and dormant memory exfiltration, now applied to the file-backed state of a local coding harness rather than a vector memory store.

Why it matters

Local agentic harnesses are exactly the environments where this matters most. They hold persistent state by design — that is what makes them useful across sessions — and they grant the agent file-write and tool-execution capability. That combination turns a stored string into latent control. It is the lethal trifecta (untrusted content, sensitive capability, persistence) expressed through the filesystem.

Two consequences follow. First, per-action guardrails create a false sense of safety: a harness can refuse every individually dangerous request and still be fully compromised, because the dangerous instruction was laundered through a benign write earlier in its history. Second, the blast radius is durable. A backdoor written to workspace state survives session boundaries, so a single exposure to a poisoned file can grant control over future runs the operator never connects to the original injection.

These are simulated results on a benchmark, not an observed in-the-wild campaign, and the harness is OpenClaw-style rather than a specific shipping product. The right reading is a lower bound on a class of weakness, not a single-vendor bug.

Defenses

The paper’s own defence, DASGuard, is a provenance-tracing approach, and its design generalises into concrete guidance.

  1. Track provenance of control-like text. DASGuard scans sensitive local files for content that reads like instructions or control flow, traces its origin, and removes control content that did not originate from a trusted source. The principle: a string’s authority should depend on where it came from, not on where it now sits. This echoes the data-flow-is-not-authority and provenance-graph lines of work.

  2. Defend the sequence, not the step. Because no single action is malicious, isolated per-action filtering is insufficient. Correlate write operations with later reads of the same artifact, and treat “agent writes instruction-shaped text into its own state, then later acts on it” as a first-class detection target.

  3. Combine runtime blocking with sanitised commits. DASGuard pairs runtime attack blocking with sanitised commits to the workspace — it does not only stop the live action, it cleans the planted backdoor out of persisted state so it cannot fire later. Treat workspace state as untrusted-until-validated on both write and read.

  4. Constrain the harness. Least-privilege file access, segregation of agent-writable state from agent-readable config, and egress control bound what any planted instruction can achieve, regardless of how it was laundered. This is the architectural posture of agents rule-of-two.

Status

ItemReferenceDateNotes
Paper publishedarXiv:2605.31042 [cs.CR]2026-05-29”From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors”
BenchmarkClawTrojanMulti-step trojan detection in local agentic harnesses
DefenceDASGuardProvenance scan of control-like text + sanitised commits
EvaluationOpenClaw-style workspace, GPT-5.495.5% ASR multi-step vs. near-zero single-turn
Code & datagithub.com/RUC-NLPIR/ClawTrojanReleased with the paper
Exploitation statusNone observed in the wildSimulated benchmark; no live-system payloads released

The takeaway is not “agents can be prompt-injected” — that is old news. It is that persistence turns a refused instruction into a stored one, and defence has to read the history of an agent’s own workspace, not just its next action.

Sources