JAILBREAK MEDIUM

Mathematical encoding jailbreaks: when set theory bypasses LLM safety

An arXiv paper posted on May 5, 2026 shows that re-expressing a harmful prompt as a set-theory or formal-logic problem bypasses safety training on 46–56% of attempts across eight frontier models — but only when a helper LLM does the reformulation, not when mathematical syntax is bolted on top.

2026-05-25 // 7 min affects: gpt-4o, gpt-5, gpt-5-mini, claude-3.5-sonnet, claude-4, gemini-1.5-pro, llama-3.1, deepseek-v3

What is this?

On May 5, 2026, Haoyu Zhang, Mohammad Zandsalimy and Shanu Sushmita posted Exposing LLM Safety Gaps Through Mathematical Encoding: New Attacks and Systematic Analysis (arXiv:2605.03441). The paper systematises an attack family that had been circulating since the MathPrompt preprint of Bethany et al. (arXiv:2409.11445, September 2024) and the Logic Jailbreak paper of May 2025 (arXiv:2505.13527): take a harmful natural-language prompt, ask a helper LLM to rewrite it as a coherent problem in set theory, abstract algebra, formal logic or quantum-mechanical notation, and submit the mathematised version to a target model.

Across eight target models and two established jailbreak benchmarks, the May 2026 paper measures an average attack success rate of 46% to 56%. The original MathPrompt result on thirteen 2024-era models was even higher at 73.6%. The new contribution is twofold: a novel Formal Logic encoding that matches or exceeds the older Set Theory encoding on frontier models, and a systematic ablation that isolates why the attack works.

How it works

The pipeline has three components: an attacker LLM, a fixed encoding scheme, and the target LLM. The attacker is prompted to translate the harmful intent into a mathematical problem statement that preserves the operational structure of the request while wrapping it in symbolic notation. The target then answers the math problem — which, decoded, is the answer to the original harmful prompt.

# Conceptual structure of the attack — illustrative, not an exploit payload.
# The May 2026 paper releases methodology and aggregate numbers, no transcripts.

harmful_prompt    = "[REDACTED — drawn from AdvBench / HarmBench]"
encoder_prompt    = ENCODING_TEMPLATE[ "set_theory" | "formal_logic" | "abstract_algebra" ]
math_problem      = attacker_llm.reformulate(harmful_prompt, encoder_prompt)
# math_problem is a coherent symbolic problem whose solution maps back
# 1-to-1 onto the harmful answer. Safety classifiers see only symbols.

answer            = target_llm.solve(math_problem)
harmful_answer    = decode(answer)   # by construction, by the attacker

The systematic ablation is the part defenders should read. The authors compare three reformulation modes: (1) a helper LLM that deeply rewrites the prompt as a genuine mathematical problem, (2) rule-based wrappers that add mathematical notation without changing the underlying semantics, and (3) the untouched harmful prompt. Mode 1 succeeds at the 46–56% rate. Mode 2 performs no better than mode 3. The conclusion: it is not the symbols that fool the model, it is the semantic distance between the surface text and the harmful intent, induced by a competent rewriter.

The embedding analysis in MathPrompt corroborates this. Encoded prompts sit far from their natural-language counterparts in the model’s representation space, which is exactly where safety classifiers — trained on natural-language harmful examples — lose discrimination.

Why it matters

Three points are worth holding together.

First, alignment training has a representational blind spot. Safety post-training generalises along the manifold of natural-language harm; it does not generalise along arbitrary semantic-preserving transformations. The mathematical-encoding family is one instance of this; cipher attacks, low-resource-language attacks and persona-based attacks are others. The May 2026 paper is the cleanest measurement of the effect on a frontier cohort that includes GPT-5 and GPT-5-Mini, which are described as substantially more robust than older models — but still vulnerable.

Second, the attack scales with attacker capability, not defender weakness alone. The reformulation step requires the helper LLM to produce a mathematically coherent rewrite. As open-weight models get better at symbolic reasoning, the rewriter step gets cheaper and more reliable. This mirrors the trend documented in the Large Reasoning Models as Autonomous Jailbreak Agents paper (Hagendorff et al., Nature Communications 2026): improving reasoning capability improves attacker capability against aligned models.

Third, the attack is not a payload, it is a transformation. There is no canonical string to filter on. Two encodings of the same harmful prompt share no surface tokens. This is also why publishing the principle, with no payloads, is the responsible choice — defenders need the conceptual lever, not the inputs.

Defenses

The paper closes with a defensive direction the authors describe as “reasoning about mathematical structure rather than surface semantics”. Concretely, for teams shipping LLM-backed products:

Filter on outputs, not just inputs. Output-side classification — checked against the user’s stated task and applied after generation — is robust to encoded inputs in a way that input-side classification cannot be. This is consistent with the result of Evaluation of Prompt Injection Defenses in Large Language Models (arXiv:2604.23887, updated May 2026): output filtering achieved zero leaks across 15,000 attacks, while every model-self-defense configuration eventually broke.
Add a decode step before serving. If your application surface only ever expects natural-language answers, parse the model’s response and reject outputs that contain expanded symbolic content, decoded ciphers, or step-by-step formal-logic derivations of operational instructions.
Use a separate, simpler classifier on the rendered intent. Rather than ask the same model to judge its own output, route the (input, output) pair through a small dedicated harm classifier — Llama Guard 3, ShieldGemma, Granite Guardian — that was trained on natural language. The decoder step before classification matters.
Constrain tool-use scopes. If the LLM is wired to tools, a successful mathematical jailbreak that returns a textual answer is worse when that answer can be executed. Per-tool allowlists and the Agents Rule of Two pattern reduce the blast radius.
Track this attack family in evaluations. Add a math-encoded variant of your refusal benchmarks. Re-run after every system-prompt change. The May 2026 paper shows newer models are more robust — but only on the encodings tested.

Status

Item	Reference	Date	Notes
Primary paper — arXiv preprint	`arXiv:2605.03441`	2026-05-05	8 target models, 2 benchmarks, 46–56% ASR
Predecessor — MathPrompt	`arXiv:2409.11445` (Bethany et al.)	2024-09-17	13 models, 73.6% average ASR
Predecessor — Logic Jailbreak	`arXiv:2505.13527`	2025-05	Formal logical expressions as encoding
Independent reference — Promptfoo LM Security DB	promptfoo.dev	2026	Catalogued as “Symbolic Math Jailbreak”
Defensive complement — output filtering	`arXiv:2604.23887`	2026-05	Zero-leak result across 15,000 attacks

The attack class is not new; the May 2026 paper is a measurement update against current frontier models and a clean ablation of why the family works. The actionable signal for defenders is the same direction the other May 2026 results have been pointing: the boundary that survives an adaptive attacker lives outside the model, in output filtering and action-layer constraints.