Measuring LLM exploit capability: ExploitBench, ExploitGym and the SCONE-bench refresh
On May 22, 2026 Anthropic published Mythos Preview results on three new exploitation benchmarks. The numbers — and the way the benchmarks decompose the exploit chain — change how defenders should think about frontier offensive capability.
What is this?
On May 22, 2026 Anthropic published Measuring LLMs’ ability to develop exploits on red.anthropic.com, reporting Claude Mythos Preview’s results on three new exploitation benchmarks: ExploitBench, ExploitGym and an updated SCONE-bench. The post is a companion to Project Glasswing — instead of counting vulnerabilities found in shipping software, it tries to measure precisely how far up the exploit-development chain current frontier models can climb.
The benchmarks themselves are the news. Two were posted to arXiv in May 2026 by external groups: ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents by Seunghyun Lee (CMU) and David Brumley (CMU / Bugcrowd), and ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? by a UC Berkeley / Max Planck / UCSB / Arizona State consortium with contributors from Anthropic, OpenAI and Google. The third, SCONE-bench, is an Anthropic-led smart-contract exploitation benchmark whose harness and dataset are now open-sourced on GitHub.
How it works
Each benchmark targets a different layer of the exploit chain. None of them publishes weaponisable payloads — the whole point is to score, programmatically, how close to a working exploit a model can get on already-patched bugs.
ExploitBench — V8 capability ladder. ExploitBench decomposes exploit development into 16 measurable capabilities grouped into five tiers, against 41 patched V8 JavaScript-engine CVEs:
T5 Coverage Reach the vulnerable code path
T4 Reproduction Trigger the bug (proof-of-concept)
T3 Target primitives Build primitives inside the V8 sandbox
T2 Generic primitives Break the sandbox: cross-process read/write/infoleak
T1 Full Control Hijack control flow / arbitrary code execution (ACE)
Each capability is verified automatically — lower tiers by differential execution against the patched build, higher tiers by challenge-response functions replayed across randomised heap layouts so a model cannot pass by hard-coding a leaked address. Models run on an identical 300-turn harness. Lee and Brumley report that current public frontier LLMs reach T5 / T4 routinely but stall at T3; per Anthropic’s measurement, Claude Mythos Preview reaches T1 (ACE) on 21 of 41 CVEs, while no other tested model achieves a single ACE in either harness variant.
ExploitGym — broad target set with end-to-end flags. ExploitGym takes 898 patched vulnerabilities from OSS-Fuzz, V8 and the Linux kernel, packages each as a containerised remote target, and grades by flag capture only when a model judge agrees the intended vulnerability was exploited (not an incidental easier bug). Wall-clock budget is two hours; models run inside their developers’ recommended harnesses. Anthropic reports Mythos Preview at 157 successes using the intended vulnerability (226 total flag captures), with Claude Opus 4.6 at 15 intended / 36 total. The ExploitGym preprint reports OpenAI’s GPT-5.5 at 120 working exploits on the same set.
SCONE-bench refresh. The smart-contract benchmark was refreshed with 12 exploits from the DefiHackLabs dataset reported after the January 1, 2026 model knowledge cut-off. Performance is the summed historical USD value of successfully drained contracts in local simulation, log-scaled. Anthropic reports Mythos Preview at ~$35M of simulated exploit value, ~75% above the next-closest model, and an updated doubling time of roughly 0.7 months for Claude exploit revenue since Opus 4.5 (down from the 1.1-month trajectory tracked since 2024).
Across all three, the same picture appears: a step-change between Opus 4.6/4.7 and Mythos Preview at exactly the layer where exploitation stops being pattern recognition and starts requiring deterministic primitive-building, sandbox escape and chain assembly.
Why it matters
Three implications for defenders, none of which require having access to Mythos.
Benchmarks are catching up to capability. Until early 2026 the public cyber benchmarks largely measured “did the model find a crash?” That is the wrong question: a crash is not an exploit, and most LLMs were saturating those benchmarks while still being unable to weaponise anything. ExploitBench’s 16-flag ladder and ExploitGym’s intended-vulnerability flag-capture rule are the first public scoring rubrics that distinguish reachability from exploitability at fine grain. That matters because every threat-model conversation now has a shared scoreboard.
The capability cliff is concrete. ExploitBench data shows that the T3→T2 jump (escaping the V8 heap sandbox) is the cliff: only Mythos Preview crosses it reliably, and only Mythos Preview combines V8 sandbox escape with control-flow hijack. ExploitGym shows the same shape across broader targets, including kernel exploits. Defenders planning around “AI can find bugs but cannot exploit them” need to update: at the private frontier, that is no longer true.
The doubling trend has not plateaued. SCONE-bench’s 1.1-month → 0.7-month doubling on smart-contract exploit value, on problems sourced after the model knowledge cut-off, is the data point Anthropic itself flagged as continuing past their previous expectation of saturation. The case for assuming next-generation public models will land near today’s private-frontier capability inside 6-12 months is stronger this month than last.
Defenses
Benchmarks do not patch anything by themselves. They do, however, change how defenders should prioritise.
- Update threat models to assume sandbox escape is in range. Browser, JS-engine and kernel teams that have been resourcing against T4-level adversaries should re-plan against T2-T1 adversaries within the next major release cycle. The Firefox 150 disclosures and the wolfSSL cert-forgery example from the Glasswing update are early datapoints; ExploitBench formalises the scoring.
- Run the benchmarks against the models you actually deploy. ExploitBench and ExploitGym ship as reproducible containerised environments; SCONE-bench is now open-source. Internal red teams can measure exactly how far their own toolchain (open-weights model + harness) gets up the ladder before investing in mitigations elsewhere.
- Push memory-safe migrations in the highest-tier surfaces. Use-after-free, OOB read/write and type confusion remain the high-yield V8 / browser / kernel classes that ExploitBench measures against. Memory-safe rewrites of hot parsers and JIT helpers are the only structural defence; everything else buys time.
- Track capability evals, not just headline launches. Anthropic’s Cyber Verification Program and the External Researcher Access Program give defenders an interface to capability information ahead of release. Equivalent programs at other labs are worth subscribing to.
- Calibrate disclosure capacity. If frontier-model exploit capability is doubling sub-monthly, expect the bug-report volume that hit Mozilla and wolfSSL this month to hit more maintainers next quarter. Pre-allocated CVE blocks, security.txt updates and an AI-assisted-triage policy are no-regret moves.
- Demand benchmark transparency from vendors. “Our model is safe” without a public score on at least one capability-ladder benchmark is no longer adequate. Procurement teams can require ExploitBench / ExploitGym / SCONE-bench scores in security questionnaires.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Anthropic post | Measuring LLMs’ ability to develop exploits | 2026-05-22 | Mythos Preview scored on three benchmarks |
| ExploitBench preprint | Lee, Brumley (CMU / Bugcrowd) — arXiv 2605.14153 | 2026-05 | 41 V8 CVEs, 16-flag capability ladder |
| ExploitGym preprint | Berkeley RDI et al. — arXiv 2605.11086 | 2026-05 | 898 vulns, OSS-Fuzz + V8 + Linux kernel |
| SCONE-bench refresh | Anthropic / MATS / Fellows | 2026-05-22 | 12 post-cutoff DefiHackLabs exploits, open-sourced |
| Headline result | ExploitBench, Baseline+Nudged | 2026-05-22 | Mythos Preview: 21/41 ACE; other models: 0/41 |
| Headline result | ExploitGym, 2h wall-clock | 2026-05-22 | Mythos Preview: 157 intended / 226 total flags |
| Headline result | SCONE-bench refresh | 2026-05-22 | Mythos Preview: ~$35M simulated exploit value |
The benchmarks themselves are the contribution worth tracking. They give the rest of the field — defenders, regulators, procurement, AISI-style evaluators — a vocabulary that distinguishes “model can find bugs” from “model can finish exploits”. The Mythos numbers are a snapshot of one frontier model in May 2026; the scoring infrastructure will keep mattering after the next generation lands.