DEFENSE MEDIUM NEW

Microsoft's agentic failure-mode taxonomy v2.0: zero-click human-in-the-loop bypass

Microsoft's AI Red Team v2.0 taxonomy (June 4, 2026) adds seven agentic failure modes and reports human-in-the-loop bypass as the most consistently exploited — including zero-click chains from a single external input.

2026-06-07 // 7 min affects: llm-agents, mcp-clients, computer-use-agents, multi-agent-systems, ai-coding-assistants

What is this?

On June 4, 2026, the Microsoft AI Red Team (AIRT) published a v2.0 update to its Taxonomy of Failure Modes in Agentic AI Systems. The original v1.0 (April 2025) was largely forward-looking, built from threat modeling and practitioner interviews. The v2.0 whitepaper is different: it is grounded in twelve months of red-team engagements against deployed agentic systems, adds seven new failure-mode categories and five new mitigation families, and cross-references OWASP, CSA, MITRE, NIST and CoSAI.

The most operationally significant finding is empirical, not conceptual: across a year of engagements, human-in-the-loop (HitL) bypass was the most consistently exploited failure mode, and several engagements produced zero-click, end-to-end chains that started from a single external input and reached data exfiltration or lateral movement with no human interaction beyond the initial agent launch.

How it works

AIRT’s framing is that real attacks rarely live in a single failure mode — compound chains are the norm. The seven new categories describe the surfaces that chain together:

Agentic supply chain compromise — a poisoned plugin, MCP server, prompt template or tool description injects natural-language instructions that alter behavior without touching any binary.
Goal hijacking — adversarial text silently redirects the agent’s terminal objective while it keeps passing plausibility checks (already observed in the wild for product-recommendation poisoning).
Inter-agent trust escalation — a confused-deputy problem induced through natural language, where an orchestrator trusts a sub-agent’s self-asserted identity or permissions.
Computer Use Agent (CUA) visual attack — instructions hidden in pixels the agent screenshots (off-viewport text, low-contrast banners, a fake “Approve” button) that are innocuous to a human reviewer.
Session context contamination — data introduced early in a long session biases later reasoning without any individual step looking anomalous.
MCP / plugin abuse — tool-description poisoning, server-side instruction injection, and cross-server override where a malicious server rewrites the behavior of trusted ones.
Capability / architecture disclosure — the agent reveals tool names, schemas, system-prompt structure, or HitL trigger logic, converting black-box probing into a white-box attack plan.

The HitL-bypass mechanics reported from live engagements are the connective tissue. Red teamers wore down review with consent fatigue (repeated low-stakes prompts), exploited probabilistic invocation (agents that decide when to ask), and used incremental escalation where no single step warranted review but the compound outcome did. A representative chain begins with cross-domain prompt injection (XPIA) in a crafted web-hosted document; capability disclosure then leaks the tool schema; a forged-but-schema-valid consent token slips a high-privilege call past approval. The agent’s pipelines do the rest. This is the architectural lesson behind the lethal trifecta and agents rule of two, now backed by twelve months of operational receipts.

Why it matters

The value here is the shift from prediction to evidence. v1.0 hypothesized HitL bypass and memory poisoning; v2.0 reports that XPIA and memory poisoning showed up at high frequency and frequently combined, that cross-domain prompt injection remained the most reliable initial-access vector, and that session contamination and incremental escalation were both highly effective and hard to detect because no individual step is anomalous.

Two consequences stand out for defenders. First, model-level evaluation will not surface these: zero-click chains, inter-agent trust escalation and session contamination only appear under system-level testing across complete task flows. Second, capability disclosure is the pivot — in many of AIRT’s highest-impact chains, execution depended on extracting architecture details first, which often required only asking the agent directly.

Defenses

The v2.0 mitigations are architectural, and AIRT is explicit that they are difficult to retrofit. The mandatory short list:

Treat agentic components as a software supply chain. Generate an SBOM that includes plugins, MCP servers, prompt templates and tool descriptions; require signature/provenance verification before install; scan registries for hidden instructions in tool descriptions, not just for malicious code; pin versions, since even a patch bump can change natural-language tool behavior.
Zero-trust inter-agent architecture. Issue each agent an attestable credential at provisioning; bind it to every message and tool call; verify the credential chain before any privilege decision. Never let a sub-agent elevate itself via self-description.
Harden the consent architecture — UX is a security control. Make HitL invocation deterministic (the agent must not decide when to ask), decompose compound actions before approval, summarize approval prompts from the underlying tool calls rather than the agent’s own description (to stop description laundering), tier approvals by reversibility and blast radius, and run anomaly detection on approval frequency to catch consent-fatigue exploitation.
Adversarial session hardening. Tag every token with its source (system / user / retrieved / tool / inter-agent), keep trusted and untrusted content structurally separate, monitor for a single retrieved document’s framing amplifying across reasoning steps, bound how much external content can steer a session, and gate sensitive tool calls once untrusted data has entered context.
Disclosure-resistant prompting and output filtering. Refuse tool-list, system-prompt and schema introspection uniformly across all input channels; scan outbound content (including memory writes and inter-agent messages) for schema fingerprints; resolve tool inventories at runtime from a non-disclosable registry; and minimize the privileged surface so a leak is worth less. Pair with visual-injection defenses for CUA surfaces.

Status

Item	Reference	Date	Notes
Taxonomy v2.0 announced	Microsoft Security Blog	2026-06-04	Grounded in 12 months of red-team engagements
v2.0 whitepaper	Microsoft AI Red Team	dated April 2026	7 new failure modes, 5 new mitigation families
Headline finding	HitL bypass	—	Most consistently exploited; zero-click chains observed
New failure modes	Supply chain, goal hijacking, inter-agent trust escalation, CUA visual attack, session contamination, MCP/plugin abuse, capability disclosure	—	Integrated into v1.0 structure, labeled [New in v2.0]
Industry alignment	OWASP ASI, CSA, MITRE SAFE-AI, NIST AI 600-1, CoSAI	—	Cross-referenced, not dependent on any one
v1.0 baseline	Taxonomy of Failure Modes v1.0	2025-04	Forward-looking predecessor

The right takeaway is not a new exploit but a calibration: a year of red teaming confirms that the durable defenses for agents are architectural — supply-chain provenance, cryptographic agent identity, deterministic and tiered consent, source-tagged context — and that the single most reliable way attackers reach high-impact outcomes is by quietly bypassing the human who was supposed to be in the loop.