system: OPERATIONAL
← back to all hacks
DEFENSE MEDIUM

Agents Rule of Two: Meta's pragmatic answer to unsolved prompt injection

Published Oct 31, 2025 by Meta and re-adopted in Databricks' May 2026 guide, the Agents Rule of Two limits any agent session to two of three risky properties — the most actionable framework while prompt injection remains unsolved.

2026-05-25 // 6 min affects: llm-agents, tool-use, rag-pipelines, mcp-clients, ai-coding-assistants

What is the Agents Rule of Two?

Meta’s AI Security team published the Agents Rule of Two on October 31, 2025, and the framework has since been picked up by Databricks (May 2026 prompt-injection guide), Oso, Xano and most agent-platform vendors. It is the most widely cited operational response to prompt injection — a problem the industry now openly admits has no reliable model-side fix.

The rule is one sentence: within a single agent session, satisfy at most two of these three properties.

  • (A) The agent processes untrustworthy inputs (web pages, emails, tool outputs, RAG documents, user-generated content).
  • (B) The agent has access to sensitive systems or private data (corporate secrets, customer PII, source code, credentials).
  • (C) The agent can change state or communicate externally (write to a database, send an email, call a paid API, push to git, post on Slack).

Pick any two. Never wire all three into the same loop without a human in the path.

How it works

The Rule of Two is a structural rewrite of Simon Willison’s lethal trifecta (coined September 2024) and is openly inspired by Chromium’s Rule of Two for sandbox boundaries. Both share the same insight: when a known-vulnerable surface (a parser, an LLM) sits between untrusted data and sensitive capabilities, you do not “fix” the parser — you remove one leg of the triangle.

Translated into design patterns:

  • A + B, no C — a read-only research assistant that summarises customer tickets with no ability to send or write.
  • A + C, no B — a public chatbot that can post a reply but only sees the current message, with no access to internal data.
  • B + C, no A — an automation that touches private data and writes back, but only consumes structured fields produced by trusted code (no free-text from the outside).

Meta’s blog post (Oct 2025) makes the failure mode explicit: when all three are present, a single indirect prompt injection in an untrusted document can turn the agent into a confused deputy — performing authorised actions with malicious intent. The same pattern is what the May 2026 Comment and Control disclosure exploited in Claude Code, Gemini CLI and GitHub Copilot Agent, and what 2026 CVEs against PraisonAI (CVE-2026-44338), Semantic Kernel (CVE-2026-25592, CVE-2026-26030) and LMDeploy (CVE-2026-33626) ultimately enable.

Why it matters

The October 2025 Attacker Moves Second paper (Nasr, Carlini et al., arXiv:2510.09023) tested twelve published prompt-injection defenses against adaptive attackers. Eleven were bypassed with attack success rates above 90%. The same conclusion appears in Output filtering beats model self-defense (Swept AI / Michigan, May 2026): on 20,000 adaptive attacks, every model-side defense eventually broke.

If model-side defenses cannot be trusted, deployment becomes an architectural problem rather than a prompting one. The Rule of Two answers that bluntly: stop trying to make the LLM safe; constrain what the LLM is allowed to reach.

Databricks’ May 2026 guide operationalises this with nine layered controls on Unity Catalog and Agent Bricks: PII redaction at the AI Gateway, Llama Protection models on input and output, egress allow-lists, capability tokens that bind a tool to a single user session, and human-in-the-loop approval when all three properties are unavoidable.

Defenses

Concrete steps before shipping an agent:

  • Classify each tool by the three properties (A, B, C). Spreadsheet level work — most teams discover their agent already violates the rule.
  • Split sessions, not just prompts. Tool-use that needs C must run in a separate process with no access to B’s secrets.
  • Quarantine untrusted content. Render web pages and emails as data, never re-feed them as instructions. ASCII-tag smuggling and indirect injection assume the agent treats tool output as authoritative.
  • Bind capabilities to user identity with short-lived tokens, not to the agent’s own session. A hijacked agent should not be able to act outside its caller’s scope.
  • Require human-in-the-loop when A + B + C is genuinely required (incident response, code-review-and-merge agents, ops runbooks).
  • Log the trifecta. Emit a structured event every time an agent session crosses two thresholds; alert on three.

Critics (notably Ken Huang’s Nov 2025 Rule of Two vs. Reality) point out that the framework does not cover memory poisoning, multi-agent collusion or training-time attacks. That is correct — the Rule of Two is a runtime architectural control, not a complete threat model. Combine it with MITRE ATLAS coverage and OWASP LLM Top 10 for the rest.

Status

ItemDateStatus
Meta blog postOct 31, 2025Public
Simon Willison endorsementNov 2, 2025Public
Michael Bargury analysisNov 1, 2025Public
Databricks operational guide2026 (May)Public
Critical review (Ken Huang)Nov 2025Public
Adoption in OWASP LLM Top 10 2026PendingDiscussion stage

The Rule of Two is not a patch. It is a deployment posture, and as of May 2026 it is the closest thing the industry has to consensus on how to ship LLM agents without waiting for a model-side breakthrough that may never come.

Sources