Behavioral geometry: predicting jailbreak susceptibility across a model population
A May 26, 2026 arXiv paper maps 79 models into a 'behavioral geometry' to predict which are jailbreak-prone — with 98% fewer probes — and to transfer defenses between them.
What is this?
Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models is an arXiv paper posted on May 26, 2026 (arXiv:2605.26409) by Hayden Helm, Xiaodong Liu and Weiwei Yang. It tackles a boring-but-expensive operational problem: there are now far too many deployable model-and-prompt configurations to red-team each one from scratch. The authors propose treating a population of models as a geometric space, so that a few measured anchors can predict the safety of everything nearby — and so that a defense tuned for one model can be transferred to its neighbours.
This is not a new attack. It is a defensive measurement framework, and its value is squarely on the blue-team side: cheaper safety evaluation and faster defense rollout across heterogeneous fleets.
How it works
The core idea is to embed models by how they behave, not by their weights or provider. Each model is probed with a battery of adversarial prompts, and its pattern of responses becomes a point in a shared space the authors build using a Data Kernel Perspective Space (DKPS) representation. Two models that respond similarly to the same probes land close together; models that diverge land far apart. The resulting layout is what the paper calls the population’s behavioral geometry.
Once that geometry exists, two practical things follow. First, susceptibility prediction: if a handful of models in a region have already been fully evaluated and are known to be jailbreak-prone, an unevaluated model that falls in the same region can be flagged as likely-susceptible without running the full suite against it. Second, defense transfer: an in-context defense optimized on one model can be applied to nearby models, using proximity in the geometry to choose which “donor” model’s defense to reuse.
The study is sizable. The authors apply the framework to 79 models spanning 24 providers, and separately to 100 system configurations of a single base model (varying the surrounding prompt and settings). Simple methods built on the behavioral geometry reach an AUPRC of 0.94 for detecting susceptible configurations while using roughly 98% fewer probes than a full evaluation. For defense transfer, selecting the donor by behavioral proximity beats naive same-provider assignment by about +2% (p = 0.03) at no extra probe cost, and the authors report that a set of three well-chosen models is enough to cover the whole population. The results are described as robust to hyperparameter choices and to the judge model used for scoring.
Why it matters
For anyone running more than one model — multiple providers, multiple fine-tunes, or one base model wrapped in many system prompts — safety does not transfer for free. A configuration that looks safe in isolation can become susceptible after a prompt change or a fine-tune, and exhaustively re-testing every variant is impractical. The behavioral geometry reframes that fleet as a structured space rather than a pile of independent unknowns, which is exactly the visibility a security team needs to triage.
The honest caveat is that this is a prediction tool, not a guarantee. A 0.94 AUPRC means strong but imperfect ranking: some susceptible configurations will sit in “safe” neighbourhoods and slip through, and the geometry is only as representative as the probe set and anchor models used to build it. Treat it as a way to prioritize scarce red-team effort, not as a substitute for testing the configurations that actually ship. It complements, rather than replaces, representation-level jailbreak detection and full benchmark runs.
Defenses
The paper is itself a defensive contribution, and it maps onto a concrete program for teams managing model fleets.
Build a population view, not a per-model view. Maintain a shared probe battery and place every deployed configuration in one comparison space, so a new variant is assessed by its neighbours instead of from a blank slate. This is the practical answer to the related finding that jailbreak transfer emerges from shared representations — shared behavior is measurable and can be exploited defensively.
Spend probes where they buy the most information. Use the geometry to pick a small set of anchor models for deep evaluation and predict the rest, then verify the predicted-susceptible and the predicted-safe-but-shipping configurations directly. The 98%-fewer-probes result is a budgeting tool, not permission to skip testing production paths.
Transfer defenses deliberately, by behavioral proximity. When reusing an in-context guardrail or refusal-tuned prompt, choose the donor model by where it sits in the geometry rather than by brand. The paper’s +2% edge over same-provider assignment is small but real, and it argues against the assumption that “same vendor” means “same safety profile”.
Re-measure on every change. Because a fine-tune or prompt edit can move a configuration to a more susceptible region, recompute its position after any change and pair the geometry with multi-turn evaluation such as a multi-turn jailbreak benchmark, since single-turn probes alone under-measure stateful risk.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| Paper | arXiv:2605.26409v1 | 2026-05-26 | Behavioral-geometry framework (DKPS) |
| Population studied | 79 models / 24 providers | — | Plus 100 configs of one base model |
| Susceptibility detection | AUPRC 0.94 | — | ~98% fewer probes vs full evaluation |
| Defense transfer | +2% over same-provider (p=0.03) | — | 3 models suffice to cover the population |
| Robustness | Stable across hyperparameters and judge | — | Authors’ own report |
The takeaway is methodological rather than alarming: jailbreak susceptibility is structured across a model population, and that structure can be measured cheaply and used to spend red-team budget — and to roll out defenses — where they matter most. It is a triage instrument, and like any triage instrument it is useful precisely because it tells you what not to test exhaustively, while still demanding that you verify what ships.