Many-shot jailbreaking: 256 examples to bypass any alignment
Anthropic researchers showed that stuffing the context window with 256 fake Q&A examples reliably bypasses safety training. Bigger context = bigger attack surface.
The trick
Take a model with a 200K+ token context window. Stuff it with 256 fake examples where an “assistant” cheerfully answers harmful questions. Then ask your real harmful question.
The model, primed by the in-context pattern, complies.
Why it works
LLMs do in-context learning. The more examples, the stronger the pattern. Safety training applies post-hoc to the model’s outputs — but when 256 examples in the immediate context show the model “this is how I answer here”, the new pattern dominates.
The data
Anthropic’s paper documents the attack success rate climbing from <5% with one example to ~80% at 256 examples, across multiple harm categories.
Defenses
- Cap effective context length for safety-critical applications
- Re-apply safety classification to outputs even when context is long
- Train models to detect in-context pattern manipulation
This is fundamentally an architectural issue with how transformers handle long context. Patches are mitigations, not fixes.