JAILBREAK

(24)

24 hack(s).

Information overloading: dense image-text prompts jailbreak vision LLMs

A July 2026 NUS paper jailbreaks vision-language models by overloading them with recursive image-typography layouts — 84% success on Gemini and GPT-4.1-mini, with prompts that transfer across model families.

2026-07-17//6 min

JAILBREAK MEDIUM NEW

Long-context jailbreaks: how goal positioning weakens LLM safety

A CMU study shows that padding a harmful request with benign filler and placing the goal early in a long context reliably degrades refusals across LLaMA, Qwen, Mistral, and Gemini.

2026-07-15//6 min

JAILBREAK MEDIUM NEW

Workflow-level jailbreaks: coding agents write what they refuse in chat

A July 2026 Alan Turing Institute study shows IDE coding agents refuse harmful prompts in chat but author the same content inside a metric-driven build workflow — 816/816 unsafe outputs across four Claude and Gemini backends.

2026-07-13//7 min

JAILBREAK MEDIUM NEW

Why diffusion LLMs resist jailbreaks — until context nesting breaks them

Diffusion language models correct many jailbreaks mid-generation, giving them a safety edge over autoregressive models. But 2026 research shows context-nesting attacks slip right past that defense.

2026-07-09//7 min

JAILBREAK CRITICAL NEW

How poetry and folktale framing jailbreak frontier LLMs

Two 2025–2026 studies show that rewriting a harmful request as verse or a Propp-style folktale bypasses safety training on nearly every frontier model — a structural jailbreak class, not a one-off trick.

2026-07-09//6 min

JAILBREAK MEDIUM NEW

Harmless questions, harmful answer: knowledge-decomposition guardrail bypass

An ICML 2026 paper shows a jailbreak that never asks anything harmful — it splits a forbidden goal into benign sub-queries, then reassembles the answer, reportedly beating commercial guardrails over 95% of the time.

2026-07-07//6 min

JAILBREAK MEDIUM NEW

Persona Attack: how accumulated conversation memory erodes safety alignment

A June 2026 paper shows that jailbreaks spread across many turns — building a persona in the model's memory — can gradually outweigh safety training, reaching high success once enough context accumulates.

2026-07-06//6 min

JAILBREAK CRITICAL NEW

Chain-of-Thought Hijacking: long reasoning traces dilute a model's refusal signal

A black-box jailbreak buries a harmful request under thousands of tokens of benign reasoning. As the trace grows, the internal refusal signal fades — reported at up to 100% success on frontier reasoning models.

2026-07-05//6 min

JAILBREAK MEDIUM NEW

The Residual Jailbreak Surface: adaptive attacks still break frontier models

A June 2026 red-team study of two frontier models finds that static obfuscation is near-dead, but adaptive iterative search still confirms harmful completions across every harm category — and it wins in the first one or two steps.

2026-07-05//6 min

JAILBREAK MEDIUM NEW

Simulated moderation traces: jailbreaking tool-enabled LLMs

A July 2026 paper shows attackers can jailbreak function-calling LLMs by faking a safety-audit workflow across tool turns — proving prompt-level filtering is not enough.

2026-07-04//6 min

JAILBREAK MEDIUM NEW

Splitting a harmful task into harmless steps slips past agent guardrails

A late-May 2026 red-teaming framework decomposes a malicious goal into individually benign-looking subtasks, reaching up to a 100% bypass rate on agents built with frontier models — and current defenses only partly contain it.

2026-07-04//7 min

JAILBREAK MEDIUM NEW

Fanfiction register: when a whole writing style becomes the jailbreak

A June 2026 arXiv paper shows that safety training under-covers an entire register of human writing — fanfiction voice — lifting mean attack success from 0.28 to 0.73 with no attacker model and no per-target tuning.

2026-07-03//6 min

JAILBREAK MEDIUM NEW

Cognitive overload: how low image resolution jailbreaks multimodal LLMs

A May 2026 paper (Findings of ACL 2026) shows that lowering the resolution of text rendered as an image pushes frontier MLLMs into an 'Attack Comfort Zone' where safety alignment collapses while OCR stays accurate.

2026-06-21//6 min

JAILBREAK MEDIUM NEW

CTF-framing jailbreaks: the prompt leaks into the attack

Sysdig (June 15, 2026) caught operators jailbreaking their own coding assistants by framing exploit requests as CTF or CVE-hunting — and the framing bleeds into User-Agents, passwords and IAM logs, leaving a cheap defender fingerprint.

2026-06-21//7 min

JAILBREAK MEDIUM NEW

RL jailbreaking: reward shape and episode length drive the attack

A June 2026 study deconstructs reinforcement-learning jailbreaking and finds the attacker's environment design — dense rewards and long episodes — matters more than the RL algorithm.

2026-06-20//6 min

JAILBREAK MEDIUM NEW

UniAttack: one automated jailbreak that targets layered LLM defenses

A June 2026 preprint builds an automated, strategy-mixing red-teaming framework and runs it against models with different stacked defenses — finding that layering guardrails does not guarantee robustness.

2026-06-20//5 min

JAILBREAK MEDIUM NEW

Adaptive jailbreaks keep breaking LLM defenses: the evaluation gap

A June 2026 framework, UniAttack, composes reusable attack features into one-shot jailbreaks that transfer across models and defenses — a reminder that any defense tested only against static attacks gives false assurance.

2026-06-18//6 min

JAILBREAK MEDIUM

IICL: pattern completion beats safety alignment with 10 examples

An April 2026 arXiv paper turns a model's own in-context learning against it: about ten abstract-operator examples make GPT-5.4 complete a harmful pattern its content filters never flag.

2026-06-17//6 min

JAILBREAK MEDIUM NEW

Para-jailbreaking: when 'safe completions' leak harm in the alternatives

An April 27, 2026 arXiv paper names a new failure mode of output-centric safety: a model can correctly refuse the direct question yet leak harmful content inside the 'safe alternative' it offers instead.

2026-06-16//6 min

JAILBREAK MEDIUM NEW

Multi-clip video jailbreaks: why video inputs break multimodal LLM safety

A June 2026 ACL paper shows the video channel is a weaker safety boundary than images: attack success climbs as a video is split into more diverse short clips.

2026-06-14//6 min

JAILBREAK MEDIUM NEW

CodeSpear: when grammar-constrained decoding becomes a jailbreak surface

A June 10, 2026 arXiv paper shows that the reliability feature forcing LLM code output to be syntactically valid can itself be turned into a jailbreak. Applying a benign code grammar can bypass refusals; the authors' CodeShield defense answers with honeypot code.

2026-06-12//6 min

JAILBREAK MEDIUM NEW

Sockpuppeting: a one-line prefill that jailbreaks 11 production LLMs

A line of code injected as the last assistant message coaxes 7 of 10 major models into harmful completions. The fix is not at the model — it is API-side message-order validation.

2026-05-28//7 min

JAILBREAK MEDIUM

Mathematical encoding jailbreaks: when set theory bypasses LLM safety

An arXiv paper posted on May 5, 2026 shows that re-expressing a harmful prompt as a set-theory or formal-logic problem bypasses safety training on 46–56% of attempts across eight frontier models — but only when a helper LLM does the reformulation, not when mathematical syntax is bolted on top.

2026-05-25//7 min

JAILBREAK CRITICAL

Many-shot jailbreaking: 256 examples to bypass any alignment

Anthropic researchers showed that stuffing the context window with 256 fake Q&A examples reliably bypasses safety training. Bigger context = bigger attack surface.

2026-05-15//6 min