MultiBreak: 10,389 multi-turn prompts expose how conversational jailbreaks slip past LLM safety
A May 3, 2026 ICML paper releases the largest, most diverse multi-turn jailbreak benchmark to date. It records attack-success-rate gaps of up to 54 points over the previous state of the art on DeepSeek-R1-7B and 34.6 on GPT-4.1-mini — and quantifies how alignment that holds in single turns collapses across follow-ups.
What is this?
On May 3, 2026, Jialin Song, Xiaodong Liu, Weiwei Yang, Wuyang Chen, Mingqian Feng, Xuekai Zhu and Jianfeng Gao posted MultiBreak to arXiv (2605.01687), accepted to ICML 2026. It is a multi-turn jailbreak benchmark — 10,389 adversarial conversations covering 2,665 distinct harmful intents — built to measure how safety-aligned LLMs hold up across natural-looking back-and-forth rather than single weaponised prompts.
The contribution is methodological as much as empirical. Earlier multi-turn datasets were either small or template-driven, so they did not stress models the way real conversational adversaries do. MultiBreak’s authors use an active-learning loop: a generator model is iteratively fine-tuned to produce attack candidates, uncertainty-based refinement selects the strongest, and the pool grows along the axes where the target model is weakest.
Against the second-best published dataset, MultiBreak’s attack success rate (ASR) is +54.0 points higher on DeepSeek-R1-7B and +34.6 points on GPT-4.1-mini. The most informative finding is not the headline ASR but the structural one: intent categories that look safe under single-turn evaluation become substantially more dangerous over several turns.
How it works
Multi-turn jailbreaks share a common shape, sometimes called Crescendo in earlier literature: the attacker begins with benign or research-flavoured questions, builds shared context, then steers the conversation in small steps until the model has implicitly endorsed an unsafe direction. Each step taken in isolation looks fine; the cumulative trajectory does not.
MultiBreak operationalises this idea at scale. The pipeline is, at a high level:
# Conceptual sketch based on the public May 3, 2026 paper.
# No exploit payload against any live system is reproduced.
[ harmful intent ] # 2,665 distinct intents
│
▼
[ generator LLM ] ──► candidate multi-turn dialogue
│
▼
[ target LLM ] ──► response trajectory
│
▼
[ judge / uncertainty ] ──► keep, refine, or discard
│
▼
[ generator fine-tuned on hard cases ] # active-learning loop
│
▼
[ 10,389 multi-turn adversarial prompts, 2,665 intents ]
Two architectural details matter. First, the diversity axis: by unifying many harmful-intent taxonomies (not just the small canonical set used in older benchmarks), the dataset surfaces categories where alignment training is thin. Second, uncertainty-based selection: the loop preferentially keeps dialogues the target model is borderline-confident about, which is where alignment is most fragile and where future small perturbations are most likely to flip the verdict.
This is consistent with independent work earlier in 2025-2026. A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks (arXiv 2507.02956) reported that safety-aligned models gradually re-encode Crescendo-style sequences as more benign than harmful as the conversation lengthens — the model’s internal representation of the same content drifts into a safer region of latent space, and the refusal classifier downstream fires less often.
Why it matters
Three reasons MultiBreak is worth taking seriously even though it does not weaponise any specific deployment.
First, it confirms a systematic gap in how safety is evaluated. Almost every public leaderboard reports single-turn ASR — one user message, one judged response. MultiBreak’s gap of tens of percentage points over single-turn baselines means a model can earn a respectable single-turn safety score and still be routinely jailbroken in normal conversational use.
Second, it documents that smaller and reasoning-tuned models are not safer by default. DeepSeek-R1-7B is a strongly reasoning-focused open model; GPT-4.1-mini is a production-grade frontier-class small model. Both show large ASR jumps. Reasoning capability does not, by itself, translate into multi-turn robustness — in some cases it gives the attacker a longer chain to exploit.
Third, the operational implication for anyone shipping LLM features. If your product exposes multi-turn chat — and almost every assistant, copilot, support bot, and RAG interface does — your single-turn red-team report is incomplete by construction. The risk surface is the trajectory, not the prompt.
Defenses
The same research wave that produced MultiBreak has produced concrete mitigations. None is a silver bullet; together they meaningfully raise the cost of multi-turn attacks.
Evaluate on multi-turn benchmarks, not just single-turn. MultiBreak is freely released for research under CC BY 4.0. Run it (or an equivalent like SEMA, MTJ-Bench, X-Boundary) against any model or guardrail you ship. Track trajectory ASR alongside the conventional single-turn ASR; if the delta is large, your alignment is leaking through the conversation.
Carry trajectory-level state in your guardrails. Most production input/output classifiers (Llama Guard 3, ShieldGemma, Prompt Guard, Microsoft Prompt Shields) score each message in isolation. Wrap them in a stateful policy layer that aggregates risk across the session — repeated borderline turns, slow escalation of topic sensitivity, or a sustained drift toward a single harmful intent should compound into a refusal even when each individual message would pass.
Use a Crescendo-aware boundary defense. X-Boundary (arXiv 2502.09990) establishes an explicit safety boundary in representation space and rejects responses that would cross it, regardless of how long the conversation has been steering toward that boundary. It demonstrably reduces ASR against multi-turn attacks without collapsing helpfulness on benign use.
Consider an active honeypot. The Active Honeypot Guardrail System (arXiv 2510.15017) reframes detection: instead of trying to refuse early, it tactically engages a suspected attack trajectory to confirm intent before issuing a hard refusal and logging the session. For products where false positives are expensive, this can outperform pure classifier-based filtering.
Reset context aggressively. Pure architectural mitigations help too. Capping conversation length, summarising and resetting state between turns, and forcing fresh system-prompt re-injection at every turn all remove some of the gradient the attacker is climbing. They cost usability and should be reserved for high-risk surfaces, but they are cheap and they work.
Treat the trajectory as the unit of safety review. This is the architectural conclusion. Most safety evaluation tooling is built around single prompts because that is what fits a leaderboard cell. The threat model is not single prompts. Build the safety case around sessions, score sessions, and red-team sessions.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| arXiv submission | MultiBreak v1, arXiv 2605.01687 | 2026-05-03 | ICML 2026 accepted |
| Authors | Song, Liu, Yang, Chen, Feng, Zhu, Gao | — | Academic + Microsoft Research affiliation |
| Benchmark scale | 10,389 multi-turn prompts, 2,665 intents | — | Largest multi-turn dataset to date |
| Best ASR delta | +54.0 pts on DeepSeek-R1-7B; +34.6 pts on GPT-4.1-mini | — | vs. second-best dataset |
| License | CC BY 4.0 | 2026-05-03 | Free to use for research and evaluation |
| Companion defenses | X-Boundary (arXiv 2502.09990), Honeypot Guardrail (arXiv 2510.15017), Representation Engineering (arXiv 2507.02956) | 2025-2026 | Multi-turn-aware mitigations |
| OpenReview discussion | openreview.net/forum?id=uJgfj5EJ2W | 2026 | Peer review record |
Multi-turn jailbreaking is no longer an exotic technique. It is the dominant mode of bypass against current safety-aligned models, and the evaluation infrastructure is finally catching up. If your safety story stops at single-prompt refusal rates, this paper is the prompt to extend it.