JAILBREAK CRITICAL

Many-shot jailbreaking: 256 examples to bypass any alignment

Anthropic researchers showed that stuffing the context window with 256 fake Q&A examples reliably bypasses safety training. Bigger context = bigger attack surface.

2026-05-15 // 6 min affects: claude-3, gpt-4-turbo, gemini-1.5

The trick

Take a model with a 200K+ token context window. Stuff it with 256 fake examples where an “assistant” cheerfully answers harmful questions. Then ask your real harmful question.

The model, primed by the in-context pattern, complies.

Why it works

LLMs do in-context learning. The more examples, the stronger the pattern. Safety training applies post-hoc to the model’s outputs — but when 256 examples in the immediate context show the model “this is how I answer here”, the new pattern dominates.

The data

Anthropic’s paper documents the attack success rate climbing from <5% with one example to ~80% at 256 examples, across multiple harm categories.

Defenses

Cap effective context length for safety-critical applications
Re-apply safety classification to outputs even when context is long
Train models to detect in-context pattern manipulation

This is fundamentally an architectural issue with how transformers handle long context. Patches are mitigations, not fixes.

Sources

→ https://www.anthropic.com/research/many-shot-jailbreaking