Why jailbreaks transfer between models — and how salting fights back
A study of 20 open-weight models finds jailbreak transfer comes from shared internal representations, not safety-training quirks. A defense called LLM salting rotates the refusal direction to break reuse.
What is this?
A jailbreak crafted against one model often works against others — a property called transferability. The study Jailbreak Transferability Emerges from Shared Representations (Rico Angell, Jannik Brinkmann, He He; arXiv:2506.12913, first posted June 2025, revised October 28 2025) tested 20 open-weight models against 33 jailbreak attacks, each applied to 313 harmful prompts, and argues that transfer is not an artifact of safety training or of model families. It is a consequence of how models internally encode language. The practical implication is uncomfortable: an attacker can precompute one jailbreak and reuse it across many deployments — the same economics as a password rainbow table. A countermeasure presented by Sophos X-Ops at CAMLIS 2025, “LLM salting” (Sophos blog), is designed to break exactly that reuse.
How it works
The transfer study isolates two factors that systematically shape whether an attack carries from one model to another: (1) how similar two models’ internal representations are on benign prompts, and (2) how strong the jailbreak is on the source model. The causal test is the convincing part — distilling a source model on only the benign responses of a target model, with no attack data at all, increases representational similarity and measurably raises transfer. The qualitative pattern is consistent: persona-style attacks (“you are an unrestricted assistant…”) transfer far more often than cipher-based prompts, because natural-language attacks ride the shared representation space while cipher tricks exploit per-model quirks that do not generalize.
The defensive side builds on Arditi et al., Refusal in LLMs is mediated by a single direction (arXiv:2406.11717): a single linear “refusal direction” in activation space largely governs whether a model declines a request. LLM salting adds a fine-tuning loss term that penalizes alignment with this precomputed refusal direction on harmful prompts, rotating the direction so the model “refuses differently.” No payloads are reproduced here — the technique is a defensive fine-tuning recipe, not an attack. In the Sophos experiments, salting was applied at the layers most aligned with the refusal direction (L = {16, 17, 18, 19, 20} on the 7B models studied).
The reported numbers are notable. Against Greedy Coordinate Gradient (GCG) suffix attacks that scored a 100% attack success rate (ASR) on the unmodified baselines, salting cut ASR to 2.75% on LLaMA-2-7B-Chat and 1.35% on Vicuna-7B, while MMLU accuracy stayed within run-to-run noise. By comparison, standard fine-tuning and system-prompt changes only brought ASR down to roughly 40–60%.
Why it matters
Model homogeneity is now the norm: thousands of products sit on a small number of base models with minimal customization. Shared representations mean a shared attack surface — a jailbreak validated once may quietly succeed against an entire class of downstream applications, exposing internal data or producing harmful output at fleet scale. The transfer research reframes this from bad luck into a structural property of representation learning. That is bad news, because you cannot patch your way out of how models encode language, and useful news, because transfer is predictable from representational similarity and breakable by changing the geometry rather than chasing individual prompts.
Defenses
Break the shared geometry. Salting-style fine-tuning rotates the refusal direction so that precomputed, transferred jailbreaks land on the wrong axis. Because the perturbation is per-deployment, an attack tuned against the public base model no longer matches your model’s internals.
Layer your defenses. Salting is not a silver bullet: it was evaluated mainly on GCG against 7B open-weight models, and the authors flag AutoDAN, TAP, and larger models as open questions. Combine it with input filtering and classifier-based detection rather than treating any single control as sufficient.
Do not rely on prompt tweaks alone. In the same experiments, system-prompt changes and ordinary fine-tuning left 40–60% of jailbreaks working. Treat those as friction, not protection.
Prioritise the high-transfer classes. Persona-style prompts transfer most readily across models; weight your detection and red-teaming toward them, and toward any benign-looking content that could carry instructions.
Reduce needless homogeneity. Where feasible, introduce per-deployment variation so that an attack reverse-engineered against the upstream base model does not generalise to your instance.
Status
| Item | Reference | Date | Note |
|---|---|---|---|
| Transfer mechanism | Angell et al., arXiv:2506.12913 | Jun 2025 (rev. Oct 28 2025) | 20 open-weight models, 33 attacks, 313 prompts each |
| Refusal direction | Arditi et al., arXiv:2406.11717 | 2024 | Refusal mediated by a single linear direction |
| LLM salting (defense) | Sophos X-Ops, CAMLIS 2025 | Oct 2025 | GCG ASR 100% → 2.75% (LLaMA-2-7B) / 1.35% (Vicuna-7B) |
| Tested attack | GCG — Zou et al., arXiv:2307.15043 | 2023 | Adversarial suffix attack |
| Open questions | AutoDAN, TAP, larger models | — | Not yet evaluated under salting |