GOVERNANCE MEDIUM NEW

No two labs measure prompt injection the same way

A June 1, 2026 comparison of the prompt-injection disclosures from Anthropic, OpenAI, Google and Meta found that no two labs share a metric, a surface, or a definition of success — so vendor numbers cannot be compared.

2026-06-05 // 6 min affects: claude-opus-4.8, chatgpt-atlas, gemini-3-pro, llama-guard

What is this?

On June 1, 2026, VentureBeat published a side-by-side comparison of the prompt-injection disclosures that Anthropic, OpenAI, Google and Meta released during spring 2026. The finding is not a new attack — it is a measurement problem. No two of the four labs measure prompt injection the same way. They test different surfaces, define “success” differently, and report at different layers of the stack, so a buyer cannot place their headline numbers side by side.

This matters because prompt injection is now the headline risk for agentic AI, and 2026 is the first year frontier labs have voluntarily printed failure rates at all. The catch, as a second writeup on June 1 put it, is that “a model showing a low injection rate under one lab’s definition may face higher exposure under another lab’s test design.” Transparency arrived before standardization.

How it works

The four disclosures diverge on three axes: how many surfaces were tested, where the measurement was taken, and what counts as a successful injection.

Anthropic put the most on the table: a 244-page Claude Opus 4.8 system card on May 28, 2026 covering four agentic surfaces (browsing, coding, agent-to-agent coordination, tool use). Its browser agent was reported hijacked 31.5% of the time before safeguards engaged, dropping to roughly 1% with defenses on (see our note on the Opus 4.8 browser-agent hijack rate).
OpenAI reported essentially one surface — connectors — and has repeatedly framed the problem as unbounded, saying prompt injection is unlikely to ever be fully “solved” for browser agents like Atlas (Fortune, Dec 2025).
Google moved the subject out of its model card and into a separate safety framework, with no published per-surface success rate.
Meta shipped no closed-model card and graded its guardrails rather than the model itself.

Lab        Surfaces tested     Measurement layer      "Success rate" given?
---------  ------------------  ---------------------  ---------------------
Anthropic  4 (agentic)         pre- AND post-safeguard  Yes — per surface
OpenAI     1 (connectors)      product-level            Partial
Google     n/a in model card   separate framework       No per-surface rate
Meta       guardrail-only      guardrail layer          Grades guardrail, not model

The result is that a “31.5%” from one lab and a “low rate” from another are not the same unit. One is a pre-mitigation model property; another is a post-mitigation product property; a third is a guardrail score. There is no shared adversarial test suite, no common threat model, and no agreed definition of a “hijack.” VentureBeat’s framing is apt: the gap mirrors software vulnerability disclosure before CVE — useful raw signals with no interoperable schema to compare them.

Why it matters

For a security team evaluating agents for production, the practical consequence is that you cannot procure on headline numbers. A lower advertised figure can reflect a narrower test, a later measurement layer, or a friendlier definition — not a safer model. Comparing them directly produces a false ranking.

It also distorts incentives. A lab that tests four surfaces and prints both pre- and post-safeguard rates looks “worse” on a naive read than a lab that grades only its guardrail and reports one tidy number. Rewarding the second behavior over the first pushes the whole field toward less disclosure, not more — the opposite of what defenders need. This is a governance problem, not a model bug, and it is the kind of thing standards bodies (NIST AI RMF, OWASP’s LLM Top 10, MITRE ATLAS) exist to fix. As of this writing, no regulator has mandated a common reporting format for agent vulnerabilities; the four disclosures are voluntary.

Defenses

You cannot patch a measurement gap, but you can stop being misled by it.

Never compare headline rates across vendors. Treat each lab’s number as valid only within its own methodology. A 31.5% pre-safeguard model rate and a “low” guardrail score are different units — refuse to rank them against each other.
Demand the methodology, not the number. Before deploying an agent in a sensitive workflow, request: which surfaces were tested, whether the rate is pre- or post-mitigation, the definition of a successful injection, and the test corpus. If a vendor will not share it, treat the headline figure as marketing.
Normalize to your own surfaces. Map each disclosure onto the surfaces you actually expose — browser, code execution, tool/connector calls, agent-to-agent. A model’s connector number is irrelevant if your deployment only uses browsing, and vice versa.
Run your own injection tests at the post-mitigation, product layer. Vendor pre-safeguard rates describe the raw model; what you ship is the model plus your guardrails, system prompt and tool scoping. Re-measure on your stack with a fixed corpus you control, and re-run it on every model upgrade.
Adopt a shared framework internally now. Until an industry standard lands, pick one reference taxonomy (OWASP LLM01, MITRE ATLAS) and require every vendor disclosure and internal test to be re-expressed in it. That gives you an apples-to-apples sheet even when the sources are apples-to-oranges.
Assume the ceiling, not the floor. Both OpenAI and independent researchers describe prompt injection as a durable, possibly unsolvable class. Design for the case where the agent will be injected — least privilege, human confirmation on sensitive actions, no lethal trifecta — rather than trusting any single published rate.

Status

Lab	Disclosure	Date	What it reports
Anthropic	Claude Opus 4.8 system card (244 pp.)	2026-05-28	4 agentic surfaces; browser 31.5% pre-safeguard, ~1% post
OpenAI	Connector / Atlas guidance	Spring 2026	One surface; frames injection as not fully solvable
Google	Separate safety framework	Spring 2026	No per-surface success rate in model card
Meta	Guardrail evaluation	Spring 2026	Grades guardrail, not the underlying model
VentureBeat	Cross-lab comparison	2026-06-01	No shared metric, surface, or success definition

The right takeaway is not “lab X is safest.” It is that the industry has started publishing prompt-injection numbers faster than it has agreed on what they mean — and until there is a CVE-like common schema for agent disclosures, the comparison work falls on the buyer. Ask for the methodology, normalize to your own surfaces, and measure on your own stack.