system: OPERATIONAL
← back to all hacks
RESEARCH MEDIUM NEW

Optimus: scoring jailbreaks beyond pass/fail reveals a stealth-optimal regime

A May 9, 2026 arXiv paper argues binary attack-success-rate hides the jailbreaks defenders should fear most. Its Optimus metric scores prompts on similarity and harmfulness, exposing a 'stealth-optimal' band where ASR collapses to zero.

2026-06-05 // 7 min affects: aligned-llms, llamaguard, promptguard, wildguard

What is this?

On May 9, 2026, researchers from the University of Texas at El Paso, Southern Illinois University Carbondale and the University of Illinois Urbana-Champaign (Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala, Syed Bahauddin Alam and Sajedul Talukder) published The Art of the Jailbreak on arXiv (cs.CR, 2605.09225). The paper’s argument is a measurement one: the field evaluates jailbreaks almost entirely with a binary attack success rate (ASR) — did the model produce harmful output, yes or no — and that single bit throws away the information defenders most need.

Their answer is Optimus, a continuous, training-free score, plus a corpus of 114,000 compositional jailbreak prompts built to study it. We are covering the evaluation contribution, not the prompt corpus: the defensible, durable finding here is that a pass/fail lens is structurally blind to a class of “quiet” jailbreaks.

How it works

Optimus scores a jailbreak prompt on two axes at once, written as J(S, H):

  • S — semantic similarity between the jailbreak prompt and the original harmful seed request. High S means the rewrite still asks the same thing.
  • H — harmfulness probability of the jailbreak output itself, estimated by a harmfulness classifier.

The two are combined through calibrated penalty functions into one continuous number, with no task-specific training required — Optimus uses off-the-shelf embedding and entailment models (the authors’ best pair is all-mpnet-base-v2 × deberta-large-mnli) rather than a fine-tuned judge that must be retrained as attacks evolve. That training-free property is the point: a binary judge or a bespoke classifier ages as soon as the attack distribution shifts; a similarity-plus-harmfulness score does not.

To produce something to measure, the authors applied 912 in-the-wild composing strategies to 125 harmful seed prompts from JailBreakV-28K, and labelled every resulting prompt into one of 14 cyberattack categories (malware, phishing, privilege escalation, data exfiltration, and so on) using a six-model majority vote. No exploit prompts are reproduced here; the canonical reference is the paper.

The headline result is a “stealth-optimal” regime. Plotting prompts in the (S, H) plane, the most dangerous ones cluster near S* ≈ 0.57, H* ≈ 0.43 — rewrites that preserve enough of the original intent to still be useful to an attacker, while staying sanitized enough on the surface to slip past filters. In exactly that band, binary ASR collapses toward zero: the attack works, but a pass/fail evaluator records a “failure” because the output doesn’t trip the coarse harmful-content check. The metric a team trusts is blindest precisely where the risk concentrates.

Why it matters

Most production LLM defenses lean on lightweight classifiers — LlamaGuard, PromptGuard, WildGuard and similar — sitting in front of an RLHF-aligned model. The paper’s threat model is the realistic one: a black-box, single-turn attacker who can iterate offline against local copies, embedding models and harmfulness estimators before sending a single polished prompt. Against that adversary, the authors’ category-aware generators reach perplexity 24–39 (versus 40–140 for AutoDAN and AmpleGCG — lower perplexity means more fluent, less anomalous-looking text) with measured safety-filter evasion on LlamaPromptGuard-2-86M.

Two consequences for defenders. First, if your red-team scorecard is ASR, you are over-reporting your own safety. The jailbreaks that score “blocked” include the stealth-optimal ones that actually succeeded. Second, category-level scoring changes where you spend effort. Optimus gives a per-attack-class breakdown — which strategies are most effective against phishing prompts versus privilege-escalation prompts — so hardening can target the categories where the model is genuinely weakest, instead of an undifferentiated “jailbreak resistance” number. This is the same critique robustness surveys have aimed at the field: evaluation that mixes attack form with threat semantics tells you little about real exposure.

Defenses

The paper is itself a defensive instrument — better measurement — but it implies concrete practice changes.

  1. Stop reporting jailbreak resistance as a single ASR number. Pair it with a continuous, two-dimensional score (similarity to intent × harmfulness) so your evaluation can see the stealth-optimal band that pass/fail hides.

  2. Score by attack category, not in aggregate. Break results down by concrete objective (malware, phishing, privilege escalation, data exfiltration) and prioritize the categories with the worst per-class scores. An aggregate “92% blocked” can conceal a 40% breach rate in one category.

  3. Test with fluent, compositional rewrites — not just templates. Defenses tuned against handcrafted DAN-style templates or token-optimization attacks will miss low-perplexity, semantically-reframed prompts. Include in-the-wild composing strategies in your red-team set.

  4. Don’t rely on surface-level content classifiers alone. A filter that keys on lexical harm signals is exactly what the stealth-optimal regime defeats. Layer representation- or activation-based detection that inspects internal state, not just output strings.

  5. Re-evaluate continuously. Because Optimus needs no retraining, it can run as a standing metric in CI against each model update — catching regressions where a new checkpoint quietly becomes easier to jailbreak in one category.

Status

ItemReferenceDateNotes
The Art of the JailbreakarXiv:2605.09225v1 (cs.CR)2026-05-09Optimus scoring + 114k compositional prompt corpus
Optimus metricPaper2026-05-09Training-free J(S,H); stealth-optimal regime S*≈0.57, H*≈0.43
GeneratorsPaper2026-05-09Perplexity 24–39 vs 40–140 (AutoDAN/AmpleGCG); evasion measured on LlamaPromptGuard-2-86M
ScopePaper2026-05-09912 composing strategies × 125 seeds (JailBreakV-28K), 14 cyberattack categories

The framing to take away is not “jailbreaks are unstoppable” — it is that the way most teams measure jailbreak resistance systematically under-counts the attacks that matter. A continuous, category-aware score that captures both semantic intent and harmfulness gives defenders a map of where their model actually breaks, which a single success-rate bit cannot.

Sources