RESEARCH MEDIUM NEW

Newer isn't always safer: non-monotonic safety alignment across model generations

A May 2026 paper red-teaming four Gemma generations found the mid-generation model was far easier to jailbreak than both its predecessor and successor — safety doesn't improve in a straight line.

2026-06-12 // 6 min affects: gemma-2, gemma-3, gemma-4

What is this?

A preprint posted to arXiv on May 30, 2026 (2606.00813) makes an uncomfortable observation: a model family’s safety does not reliably improve from one generation to the next. The authors ran an automated red-teaming probe across four generations of Google’s open Gemma family (roughly 7B–31B parameters) and found that the middle generation was substantially more vulnerable to jailbreaks than both the older model before it and the newer model after it.

The headline numbers, as reported in the paper: Gemma 3 (12B) showed a 68.7% ± 5.7% attack success rate (ASR), markedly higher than its predecessor Gemma 2 (45.5% ± 7.2%) and its successor Gemma 4 (33.9% ± 1.8%). In other words, the curve went up before it came down. If you had assumed “newer release = safer model” and upgraded from Gemma 2 to Gemma 3, you would have moved to a more jailbreakable system, not a less one. This is what the authors call non-monotonic safety alignment.

How it works

The study uses a quality-diversity evolutionary search called MAP-Elites as the red-teaming engine. Instead of optimizing for a single best jailbreak, MAP-Elites maintains an “archive” of diverse prompts that each succeed in a different region of an attack-behavior space. The result is not one payload but a map of many distinct working attacks per model — which is what makes the cross-generational comparison meaningful: you are measuring a broad attack surface, not a single lucky string.

The second finding is about transfer. Attacks evolved against one generation were replayed against the others. They transferred to Gemma 3 at 44–46%, but to Gemma 4 at only 14–18%. That gap is the encouraging part of the result: Gemma 4’s safety gains appear to generalize beyond the specific attack distributions evolved against earlier models, rather than just memorizing patches for known prompts. Note also that some harm categories — the paper flags copyright reproduction and certain cybercrime prompts — register near-100% success across every generation, a reminder that aggregate ASR hides categories where alignment barely moved at all.

No working jailbreak strings are reproduced here, and the paper’s value is diagnostic rather than offensive: it is a measurement methodology, building on the well-established line of work on transferable adversarial attacks against aligned models (Zou et al., 2023).

Why it matters

Most upgrade decisions implicitly assume monotonic improvement: the new version is better on benchmarks, therefore safer, therefore a free swap. This paper shows that assumption is unsafe. A regression can be introduced by a change in instruction-tuning data, a new chat template, a capability jump that outpaces the safety training, or a shift in refusal behavior — none of which show up on a capability leaderboard. For anyone who pins a model version in production, “we moved to the latest” is not evidence of improved safety; it can be the opposite.

The finding also undercuts the habit of evaluating only the newest model. If safety is non-monotonic, your threat model has to account for the specific version you deploy — including older versions still pinned in long-lived systems.

Defenses

Re-test on every version bump. Treat a model upgrade like a code change: run your jailbreak and harm-category evaluation suite against the exact version you will ship, and gate the rollout on the results. Do not inherit the previous version’s safety sign-off.
Measure a distribution, not a single prompt. Diversity-aware red-teaming (quality-diversity / archive-based search) exposes broad weaknesses that a single optimized attack misses. Keep an archive of working categories and replay it against each new release.
Watch transfer as a leading signal. If last generation’s attacks still transfer well to the new model, its safety gains are probably shallow. Low transfer is weak evidence of genuine generalization.
Track per-category ASR, not just the average. Aggregate numbers hide categories (the paper flags copyright and some cybercrime prompts) that stay near-100% across generations. Defend those with external controls — output filters, retrieval/tool gating — rather than trusting model-level refusal.
Pin and document the version. Record exactly which model build you evaluated and deployed, so a silent provider-side update can’t quietly change your risk posture.

Status

Item	Detail
Paper	”Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs”
arXiv ID	2606.00813
Posted	May 30, 2026
Models studied	Google Gemma, four generations (~7B–31B)
Method	MAP-Elites quality-diversity red-teaming
Key result	Gemma 3 (12B) ASR 68.7% > Gemma 2 45.5%, > Gemma 4 33.9%
Cross-gen transfer	44–46% to Gemma 3; 14–18% to Gemma 4
Nature	Defensive measurement study — no exploit code released