system: OPERATIONAL
← back to all hacks
DEFENSE LOW NEW

Architecting secure agents: a plan-and-policy defense against prompt injection

An NVIDIA position paper (March 31, 2026) argues that indirect prompt injection cannot be fixed at the model alone — and proposes a plan-and-policy system architecture that constrains what an agent may observe and decide.

2026-06-16 // 6 min affects: llm-agents, tool-using-agents, mcp-clients, rag-systems

What is this?

On March 31, 2026, researchers from NVIDIA and collaborators published Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks (arXiv:2603.30016). It is a position paper, not a new attack: its starting point is that indirect prompt injection — malicious instructions hidden in retrieved emails, web pages, or tool outputs, first formalized by Greshake et al. in Not what you’ve signed up for (2023) — is unlikely to be solved at the model level alone. The authors instead ask how the system around the model should be structured so that a single injected string cannot escalate into a dangerous action.

Their answer is an architecture built on two concepts — plan and policy — and three design positions about where security decisions should live. One of the co-authors, Kai Greshake, was on the original indirect-injection paper, which makes the framing notable: the people who named the problem now argue the fix is architectural.

How it works

The paper introduces a vocabulary for analyzing agents. A plan describes what an agent intends to do: an ordered sequence (or graph) of execution steps, each step being a concrete action with its inputs and outputs — for example, GET_RECENT_EMAIL(sender=Alice) -> emails; SUMMARIZE(emails) -> summary; DRAFT_REPLY(summary) -> draft. A policy describes what the agent is allowed to do: a predicate over steps and execution history that marks each action as permitted or not, inducing the subset of plans the agent may legally execute. Policies can range from global static access-control rules (“never read data the user cannot access”) to context-dependent information-flow constraints.

The reference architecture wires these together as distinct modules rather than one monolithic model:

  1. An Orchestrator (an LLM) turns a high-level task into an initial plan and policy.
  2. A Plan/Policy Approver reviews that plan and policy, gives feedback, and can escalate to a human for ambiguous objectives.
  3. An Executor (an LLM) turns the approved plan into a concrete action, such as a tool call with arguments.
  4. A Policy Enforcer approves or blocks each proposed action — using rule-based checks, an LLM judge, or human confirmation for high-risk steps — before it ever reaches the environment.
  5. The Environment (APIs, the web, the file system) runs only approved actions and returns responses, which may trigger plan or policy updates.

Crucially, environmental feedback passes through checkpoints (marked as “shields” in the paper) where the system can pass raw text, transform or filter it into a safer representation, or monitor for anomalies — so untrusted tool output never silently becomes a new instruction.

Why it matters

Most deployed agents collapse all of these roles into a single model that plans, decides what is allowed, and acts in one undifferentiated token stream — exactly the condition that makes indirect injection effective, because the model cannot reliably separate trusted commands from untrusted data. By making the plan and policy explicit and enforcing them in separate components, the architecture shrinks the attack surface: an injected instruction in a retrieved email might corrupt the Executor’s proposed action, but it still has to pass a Policy Enforcer that was configured independently of that content. The authors also warn that current benchmarks can create a “false sense of utility and security,” because they often test models in isolation rather than the end-to-end system that would actually defend a production agent.

Defenses

The paper’s contribution is a defensive blueprint, organized as three positions for practitioners building agentic systems:

  • Dynamic, security-aware replanning. Static, one-shot plans break in realistic environments. The system should be able to update both plan and policy as context evolves — but treat each update as a security-relevant event, not a free-form rewrite.
  • Use LLMs only where you must, and constrain them. Programmatic, rule-based checks should handle anything that can be formalized (access control, allow-lists). Reserve LLM judgment for genuinely hard, context-dependent decisions — and when an LLM does make a security call, tightly limit what it can observe and what it is allowed to decide. A constrained input and a narrow decision scope make the model far harder to manipulate and make robustness research tractable.
  • Treat human interaction as a core design element. Ambiguous cases are unavoidable, so human oversight cannot be bolted on; the open challenge is reducing how often a human must intervene without sacrificing security or utility.

These positions align with the broader 2026 defensive consensus — including Design Patterns for Securing LLM Agents against Prompt Injections and Meta’s “Agents Rule of Two” — that least privilege, isolation of untrusted content, and deterministic egress control belong in the system architecture, not solely in the model’s weights.

Status

This is a peer-community position paper (arXiv:2603.30016, posted March 31, 2026), not a vulnerability disclosure, so there is no patch or CVE. The authors describe the architecture as a “skeleton” for future agentic systems and call for benchmarks that evaluate whole systems rather than isolated models. The practical takeaway for teams shipping agents today: separate planning, policy, and enforcement; keep policy checks programmatic wherever possible; and constrain any model used for a security decision to the narrowest possible input and authority.

Sources