RESEARCH
(10)10 hack(s).
Measuring LLM exploit capability: ExploitBench, ExploitGym and the SCONE-bench refresh
On May 22, 2026 Anthropic published Mythos Preview results on three new exploitation benchmarks. The numbers — and the way the benchmarks decompose the exploit chain — change how defenders should think about frontier offensive capability.
Proprietary Problems: Cisco's 15-model paired-regime study shows single-turn safety scores miss most multi-turn risk
A May 27, 2026 Cisco study of 15 flagship closed models from OpenAI, Anthropic, Google, Amazon and xAI records multi-turn attack success rates of 7.89% to 88.30% — and cross-regime gaps up to 55 percentage points over single-turn baselines.
The agent-human security gap: what production ships, what papers study
A May 23, 2026 UCLA paper audits 59 academic studies, 21 production agent systems and 26 security plugins — and finds that the defenses researchers favor have zero production deployment.
The Autonomy Tax: how defense training breaks LLM agents
A March 19, 2026 USC paper measures the cost of prompt-injection-defense training on agent competence — defended models time out on 99% of tasks, vs 13% for undefended baselines.
Poisoning the Watchtower: when SOC copilots read attacker-controlled logs
A May 23, 2026 paper formalises log-substrate prompt injection — adversarial content in log fields steering LLM-based SOC assistants. Best defense leaves 11.8% average injection success.
MultiBreak: 10,389 multi-turn prompts expose how conversational jailbreaks slip past LLM safety
A May 3, 2026 ICML paper releases the largest, most diverse multi-turn jailbreak benchmark to date. It records attack-success-rate gaps of up to 54 points over the previous state of the art on DeepSeek-R1-7B and 34.6 on GPT-4.1-mini — and quantifies how alignment that holds in single turns collapses across follow-ups.
Teaching Claude Why: how Anthropic drove agentic misalignment to zero
On May 8, 2026, Anthropic's Alignment Science team published a case study showing that teaching Claude to explain its ethical reasoning — not just demonstrate it — cut agentic misalignment from 96% to under 1%.
Contextual integrity: why prompt-injection defenses keep failing
A May 2026 paper by Abdelnabi and Bagdasarian recasts prompt injection through Contextual Integrity and shows that data-instruction separation is a category mistake.
When the attacker is another LLM: large reasoning models as autonomous jailbreakers
A Nature Communications paper formalised in May 2026 shows four reasoning models — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini and Qwen3 235B — jailbreaking nine target LLMs with a 97.14% overall success rate, armed with nothing but a single system prompt.
Sleeper agents: hidden backdoors that survive safety training
Anthropic demonstrated that models trained with hidden trigger phrases retain backdoor behavior even after standard RLHF safety training. The implications for open-weight LLMs are significant.