system: OPERATIONAL
← back to all hacks
RESEARCH MEDIUM

When the attacker is another LLM: large reasoning models as autonomous jailbreakers

A Nature Communications paper formalised in May 2026 shows four reasoning models — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini and Qwen3 235B — jailbreaking nine target LLMs with a 97.14% overall success rate, armed with nothing but a single system prompt.

2026-05-25 // 6 min affects: gpt-4o, claude-4-sonnet, deepseek-v3, gemini-2.5, grok-3, qwen3, open-weight-reasoning-models

What is this?

The paper Large Reasoning Models Are Autonomous Jailbreak Agents by Thilo Hagendorff, Erik Derner and Nuria Oliver was first posted as an arXiv preprint on August 5, 2025 (arXiv:2508.04039) and formally published in Nature Communications in 2026 (Nat Commun 17, 1435). Coverage of the formal publication has intensified in May 2026, with secondary analyses from redteams.ai and pebblous.ai treating it as the most-cited jailbreak result of the year. The thesis is uncomfortable: four large reasoning models (LRMs) — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B — given a single system prompt and no further supervision, autonomously jailbreak nine widely deployed target models with an overall success rate of 97.14%.

The authors call the phenomenon alignment regression: improving a model’s reasoning capability simultaneously improves its capability as an attacker against other aligned models. The cost curve of red-teaming, until recently measured in human hours per successful jailbreak, collapses toward zero.

How it works

The experimental setup is deliberately minimal. Each LRM receives one system prompt — a brief description of the role of adversarial evaluator — and a list of harmful prompts drawn from a public benchmark covering several sensitive domains. The LRM is then connected to a target model and runs a multi-turn conversation. There is no human in the loop after the system prompt is set, no payload library, no manual iteration, no gradient-based optimisation. The attacker plans, drafts, sends, observes the refusal, refines, and tries again — using only its own reasoning chain.

The threat model assumed by the paper is therefore very weak from an attacker’s standpoint: black-box access to a target’s API, an off-the-shelf LRM, and a one-paragraph system prompt. No model weights, no architecture knowledge, no specialised tooling. The setup is conceptually closer to PAIR (Chao et al., 2023) than to GCG (Zou et al., 2023), but with a sharper finding: the persuader does not need to be fine-tuned for the role. Stock LRMs are already persuasive enough.

# Conceptual sketch of the attack loop — illustrative, not exploit code.
# The actual paper releases neither payloads nor jailbreak transcripts.

attacker = LRM(model="deepseek-r1", system_prompt=ADVERSARIAL_EVALUATOR_PROMPT)
target   = LLM(model="gpt-4o")        # or claude-4-sonnet, gemini-2.5-pro, ...

for harmful_prompt in benchmark:
    history = []
    for turn in range(MAX_TURNS):
        attacker_msg = attacker.plan_next(history, goal=harmful_prompt)
        target_msg   = target.respond(history + [attacker_msg])
        history     += [attacker_msg, target_msg]
        if judged_unsafe(target_msg):     # rubric-based evaluator
            break                          # successful jailbreak

The asymmetry of results across targets is as informative as the headline number. Per the secondary analyses, Claude 4 Sonnet held the maximum per-condition harm rate to 2.86%, while DeepSeek-V3 sat at the other end with roughly 90% — a 31× gap. The same attacker, the same prompts, the same harnesses. The variance is explained by the quality of the target’s safety post-training, not by any obvious capability difference.

Why it matters

Three takeaways deserve emphasis, all aligned with the Output filtering (Deep et al., May 2026) and ARGUS (Weng et al., May 2026) results we covered earlier this month.

First, the cost of running a competent adversarial evaluator has fallen to the cost of a single LRM API call per turn. Defenders who relied on red-teaming being expensive — implicitly or explicitly — are now in a different threat landscape. Independent reviewers can stress-test a model the same week it ships.

Second, alignment regression is now an empirical fact, not a thought experiment. The same training that made LRMs better at solving multi-step reasoning problems made them better at constructing multi-turn persuasion plans. There is no published technique that decouples the two capabilities. Frontier labs releasing a reasoning model can expect that model to be turned against their competitors — and against future versions of itself.

Third, the 31× variance across targets is a defender’s lever. The result is reproducible at a small budget and gives concrete signal on which safety-training pipelines transfer under autonomous adversarial pressure. The corollary for anyone procuring a model: ask vendors for their numbers under autonomous-LRM attack, not just under static jailbreak benchmarks.

Defenses

The paper itself is a measurement, not a defense. The practical implications for teams shipping LLM-backed products in 2026:

  1. Evaluate under autonomous-LRM attack, not static prompts. Static benchmarks like AdvBench measure 2023 attacks. A defender’s evaluation pipeline should include at least one open-weight LRM running as adversary across a fixed budget of turns.
  2. Treat safety as a system property, not a model property. The independent results from Deep et al. (output filtering) and Weng et al. (provenance-graph auditing) point the same direction: the boundary that survives an adaptive attacker lives outside the model. A target with weak alignment can still be a safe product if its tool registry, output filter and action layer are correctly scoped.
  3. Constrain multi-turn surface area for sensitive use cases. The attacks work because the target model has no memory of the adversarial framing across turns. Application-side conversation policies — turn limits, topic locks, escalation gates — reduce the surface the persuader is allowed to work over.
  4. Track which LRMs your suppliers ship. Open-weight reasoning models that can be self-hosted change the attacker’s economics in ways that closed-weight peers do not. Procurement and security teams should treat a major LRM release as a defender event, not just a capability event.
  5. Do not ship the same safety post-training across model families and assume parity. The 31× variance shows that “we used RLHF” is no longer a sufficient answer about robustness. Ask, and publish, per-attacker numbers.

Status

ItemReferenceDateNotes
arXiv preprint (v1)arXiv:2508.040392025-08-054 LRMs × 9 targets
Nature Communications publicationNat Commun 17, 14352026DOI 10.1038/s41467-026-69010-1
Secondary analysis — redteams.airedteams.ai blog2026-05Frames alignment regression as cost-curve collapse
Secondary analysis — pebblous.aipebblous.ai report2026English and Korean editions
Code & dataMentioned in paperAuthors describe pipeline; no payload library released

The deeper signal: the field has crossed a threshold where the most capable adversary against an aligned model is no longer a human red-teamer or a custom optimiser. It is another aligned model. The next twelve months of safety research will be measured against that baseline.

Sources