system: OPERATIONAL
> cat /hacks/*.md | wc -l

All hacks (75)

Open database of LLM attacks, jailbreaks, and defenses. Updated daily.

AGENTS CRITICAL NEW

SymJack: one approved file copy becomes RCE in six AI coding agents

Adversa AI disclosed on May 26, 2026 a symlink-hijack pattern that turns a single benign-looking shell copy into a config overwrite and host RCE across Claude Code, Cursor, Gemini, Antigravity, Copilot, Grok Build and Codex CLIs.

2026-05-30//6 min
SUPPLY CHAIN MEDIUM NEW

Slopsquatting in 2026: 127 package names that all five frontier LLMs hallucinate

A May 16, 2026 arXiv replication of the USENIX Security '25 slopsquatting study finds hallucination rates are down across frontier models — but identifies 127 phantom packages that every tested model invents identically, a model-agnostic supply-chain attack surface.

2026-05-29//6 min
AGENTS MEDIUM NEW

Blindfold: action-level jailbreaks bypass semantic defenses on embodied LLMs

A SenSys '26 paper (May 11–14, 2026) introduces Blindfold, an automated framework that jailbreaks embodied LLMs by decomposing harmful goals into individually benign actions — up to 53% higher attack success than semantic-level baselines on a real 6DoF robotic arm.

2026-05-29//6 min
INFRASTRUCTURE CRITICAL NEW

MCPwn (CVE-2026-33032): nginx-ui MCP endpoint hands over the web server

An unauthenticated MCP endpoint in nginx-ui ≤ 2.3.3 lets any network attacker rewrite nginx configs and restart the service. CVSS 9.8, publicly disclosed on April 15, 2026, exploited in the wild within hours of the patch.

2026-05-29//6 min
RESEARCH MEDIUM NEW

Measuring LLM exploit capability: ExploitBench, ExploitGym and the SCONE-bench refresh

On May 22, 2026 Anthropic published Mythos Preview results on three new exploitation benchmarks. The numbers — and the way the benchmarks decompose the exploit chain — change how defenders should think about frontier offensive capability.

2026-05-29//7 min
RESEARCH MEDIUM NEW

Proprietary Problems: Cisco's 15-model paired-regime study shows single-turn safety scores miss most multi-turn risk

A May 27, 2026 Cisco study of 15 flagship closed models from OpenAI, Anthropic, Google, Amazon and xAI records multi-turn attack success rates of 7.89% to 88.30% — and cross-regime gaps up to 55 percentage points over single-turn baselines.

2026-05-29//7 min
DEFENSE MEDIUM NEW

One million exposed AI services: what the Intruder scan actually found

On May 5, 2026, Intruder published the results of an internet-wide scan that mapped 1 million exposed AI services across 2 million hosts. The recurring failure is not exotic — it is permissive defaults.

2026-05-29//7 min
RESEARCH MEDIUM NEW

The agent-human security gap: what production ships, what papers study

A May 23, 2026 UCLA paper audits 59 academic studies, 21 production agent systems and 26 security plugins — and finds that the defenses researchers favor have zero production deployment.

2026-05-29//6 min
RESEARCH MEDIUM NEW

The Autonomy Tax: how defense training breaks LLM agents

A March 19, 2026 USC paper measures the cost of prompt-injection-defense training on agent competence — defended models time out on 99% of tasks, vs 13% for undefended baselines.

2026-05-29//6 min
DEFENSE MEDIUM NEW

MCP needs a trust handshake: attested tool-server admission

A May 22, 2026 arXiv paper proposes mcp-attested — a backward-compatible MCP extension that gates tool dispatch on signed clearance, deny-by-default allowlists, and tamper-evident audit logs.

2026-05-29//6 min
DEFENSE MEDIUM NEW

WARD: a co-evolved guard model that holds up against adaptive prompt injection on web agents

A May 14, 2026 NUS paper proposes WARD — a guard model trained against a memory-driven adversarial attacker — and reports near-perfect out-of-distribution recall on web-agent prompt injection.

2026-05-29//7 min
AGENTS MEDIUM NEW

MemMorph: hijacking tool selection in LLM agents through fluent memory poisoning

A May 24, 2026 arXiv paper from NTU Singapore shows three plausible-looking memory entries can steer an agent toward an attacker-chosen tool with 85.9% success — and survive three off-the-shelf defenses.

2026-05-29//6 min
ADVERSARIAL MEDIUM NEW

SilentRetrieval: fluent RAG corpus poisoning that slips past perplexity filters

A May 27, 2026 arXiv preprint introduces a two-stage attack that hides goal-hijacking triggers inside fluent documents, reaching 57% LLM-attack success on Natural Questions and MS MARCO with one poisoned record per query.

2026-05-29//6 min
GOVERNANCE MEDIUM

CISA + Five Eyes publish the first joint guidance on agentic-AI adoption

On May 1, 2026, CISA, NSA and the Five Eyes cyber agencies released 'Careful Adoption of Agentic AI Services' — a 5-risk taxonomy and a deployment playbook that critical-infrastructure operators are now expected to fold into their existing cybersecurity frameworks.

2026-05-28//7 min
AGENTS CRITICAL NEW

Microsoft Copilot Cowork: poisoned skills exfiltrate M365 files with no approval

PromptArmor's May 26, 2026 disclosure shows that a five-line prompt injection inside a Copilot Cowork skill file can leak SharePoint and OneDrive documents through auto-approved Teams messages — no patch closes the design.

2026-05-28//7 min
MULTIMODAL MEDIUM

CrossMPI: image-only prompt injection steers what VLMs read and see

A May 15, 2026 Xidian University arXiv paper introduces CrossMPI: imperceptible image perturbations that change how vision-language models interpret both the image and the user's text prompt, with 66% average success across five LVLMs.

2026-05-28//6 min
INDIRECT INJECTION MEDIUM NEW

IterInject: when an LLM optimiser writes its own indirect prompt injections

A May 23, 2026 paper closes the loop between payload, diagnoser and LLM optimiser — lifting indirect-injection ASR from near-zero to 33–90% on InjecAgent and compromising 5 of 9 Claude Code targets.

2026-05-28//6 min
GOVERNANCE MEDIUM NEW

NSA AISC publishes MCP security design guidance for production AI

On May 20, 2026, NSA's Artificial Intelligence Security Center released a 15-page Cybersecurity Information Sheet on Model Context Protocol — eight classes of weakness, five real-world incidents, nine defensive recommendations.

2026-05-28//8 min
RESEARCH MEDIUM

Poisoning the Watchtower: when SOC copilots read attacker-controlled logs

A May 23, 2026 paper formalises log-substrate prompt injection — adversarial content in log fields steering LLM-based SOC assistants. Best defense leaves 11.8% average injection success.

2026-05-28//7 min
SUPPLY CHAIN MEDIUM

pgAdmin 4 ships an LLM panel and a classic LFI+SSRF arrives with it (CVE-2026-7817)

pgAdmin 4 9.15 patches an authenticated LFI and SSRF in its new LLM API configuration endpoints. The bug class is decades old; the surface is brand new.

2026-05-28//6 min
AGENTS MEDIUM NEW

Temporal memory contamination: longitudinal safety drift in memory-equipped LLM agents

Three arXiv papers from April and May 2026 converge on a failure mode complementary to memory poisoning — memory-equipped agents drift unsafe as benign context accumulates, with compressed summaries acting as a laundering channel.

2026-05-28//7 min
GOVERNANCE MEDIUM NEW

The pressure: open-source security teams under the AI-assisted vulnerability flood

On May 26, 2026, curl's Daniel Stenberg published 'The pressure' — more than one credible security report per day, twelve confirmed CVEs in half a release cycle, and a pattern other maintainers are now reporting in parallel.

2026-05-28//7 min
AGENTS MEDIUM NEW

The agent harness is your real privilege boundary — and most teams draw it in the wrong place

A May 26, 2026 Pillar Security write-up argues the harness — Claude Code, Cursor, Codex — holds the secrets, tools and hooks an agent never sees. Recent harness bugs and CVE-2026-22708 make the case concrete.

2026-05-28//7 min
JAILBREAK MEDIUM NEW

Sockpuppeting: a one-line prefill that jailbreaks 11 production LLMs

A line of code injected as the last assistant message coaxes 7 of 10 major models into harmful completions. The fix is not at the model — it is API-side message-order validation.

2026-05-28//7 min
INDIRECT INJECTION MEDIUM NEW

GrafanaGhost: indirect prompt injection chained with a URL-parse bug to exfiltrate dashboard data

Noma Security's April 7, 2026 disclosure shows how three modest defects — a stored injection point, a startsWith('/') URL check, and a one-word guardrail bypass — combine into a silent exfiltration path through Grafana's AI assistant.

2026-05-28//6 min
AGENTS MEDIUM

Networks of agents break in new ways: Microsoft's red-team, plus RAMPART and Clarity

Microsoft Research red-teamed an internal platform of 100+ always-on agents. Four attack patterns — propagation, amplification, trust capture, proxy chains — show up only at the network level. RAMPART and Clarity, open-sourced May 20, 2026, are the response.

2026-05-27//8 min
AGENTS CRITICAL

Antigravity find_by_name: when a native tool call jumps over Secure Mode

On April 20, 2026, Pillar Security disclosed that a single unsanitised parameter in Google Antigravity's find_by_name tool turned file search into arbitrary code execution — and bypassed the IDE's strictest sandbox.

2026-05-27//7 min
OFFENSIVE AI MEDIUM

Apple's May 2026 bulletin formally credits Claude on two macOS CVEs

On May 11, 2026, Apple's macOS Tahoe 26.5 advisory named Claude alongside its researchers on two CVEs — a kernel integer overflow and a WebKit use-after-free. AI-assisted vulnerability research is now in the official changelog.

2026-05-27//6 min
INFRASTRUCTURE CRITICAL

BadHost (CVE-2026-48710): one Host-header character bypasses auth in Starlette, vLLM and FastMCP

X41 D-Sec disclosed on May 22, 2026 a critical auth bypass in Starlette < 1.0.1. A single / ? or # in the HTTP Host header desynchronises the routed path from the path the middleware sees, breaking path-based authorization in vLLM, LiteLLM, FastMCP and thousands of FastAPI-based AI agents.

2026-05-27//7 min
DATA LEAK CRITICAL

Bleeding Llama: a GGUF parsing flaw leaks Ollama process memory to unauthenticated attackers

CVE-2026-7482, publicly disclosed in May 2026 and codenamed Bleeding Llama by Cyera, lets a remote attacker pull arbitrary chunks of an Ollama server's heap — API keys, system prompts, other users' conversations — with three unauthenticated API calls. The silent patch shipped 2.5 months before the CVE was assigned.

2026-05-27//7 min
AGENTS CRITICAL

ClaudeBleed: when a browser agent trusts the wrong extension

LayerX disclosed ClaudeBleed on May 6, 2026: a trust-boundary flaw let any Chrome extension drive Claude in Chrome and exfiltrate Gmail, Drive and GitHub data. The first patch was bypassed within hours.

2026-05-27//7 min
PROMPT INJECTION CRITICAL

Encoded prompt injection: when guardrails fail because the LLM decodes the payload

On May 4, 2026 a tweet written in Morse code drained around $175K from a Grok-controlled crypto wallet. The incident is the most expensive demonstration to date of an old defensive blind spot — string-matching guardrails can't see through encodings that the model itself happily decodes.

2026-05-27//7 min
OFFENSIVE AI MEDIUM

The first CVE wave: AI-assisted discovery is reshaping disclosure volumes

VulnCheck's May 14, 2026 analysis shows year-to-date CVE issuance up +563% on Chrome, +476% on GitHub, +180% on VMware, +170% on Apache. The systemic shift behind the Apple, Mozilla and ActiveMQ headlines is now visible in the numbers.

2026-05-27//7 min
PROMPT INJECTION MEDIUM

Font-mapping prompt injection: when peer review becomes an LLM attack surface

A May 25, 2026 arXiv benchmark shows hidden font-mapping payloads can flip LLM peer reviews from reject to accept. ICML 2026 already used the same trick in reverse to desk-reject 497 papers.

2026-05-27//7 min
AGENTS CRITICAL

MCP STDIO transport: the design choice that became 11 CVEs and 200,000 exposed agents

On April 16, 2026, OX Security disclosed that Anthropic's MCP STDIO transport executes any OS command it is handed. Anthropic called it 'by design'. The cascade has produced eleven downstream CVEs in six weeks.

2026-05-27//7 min
RESEARCH MEDIUM

MultiBreak: 10,389 multi-turn prompts expose how conversational jailbreaks slip past LLM safety

A May 3, 2026 ICML paper releases the largest, most diverse multi-turn jailbreak benchmark to date. It records attack-success-rate gaps of up to 54 points over the previous state of the art on DeepSeek-R1-7B and 34.6 on GPT-4.1-mini — and quantifies how alignment that holds in single turns collapses across follow-ups.

2026-05-27//7 min
AGENTS CRITICAL

When prompts become shells: prompt injection escalates to RCE in agent frameworks

Two CVEs in Microsoft Semantic Kernel and four in CrewAI — all disclosed in early 2026 — turn a single injected prompt into remote code execution on the host. The pattern is structural, not incidental.

2026-05-27//7 min
RESEARCH LOW

Teaching Claude Why: how Anthropic drove agentic misalignment to zero

On May 8, 2026, Anthropic's Alignment Science team published a case study showing that teaching Claude to explain its ethical reasoning — not just demonstrate it — cut agentic misalignment from 96% to under 1%.

2026-05-27//7 min
AGENTS MEDIUM

Poison once, exploit forever: persistent memory poisoning of LLM agents (OWASP ASI06)

An April 2026 arXiv paper on cross-site memory poisoning and a May 13, 2026 OWASP post on the Cisco MemoryTrap finding against Claude Code converge on the same lesson: agent memory is a trust boundary.

2026-05-26//7 min
AGENTS MEDIUM

Treating AI agents like operating systems: a CISPA blueprint for isolation and privilege

A May 14, 2026 CISPA paper applies decades of OS security thinking to LLM agents. Tested on four OpenClaw-like systems, two weakness classes — cross-user exfiltration and unauthorized network egress — fail in every single one.

2026-05-26//7 min
OFFENSIVE AI CRITICAL

AI-assisted ICS attack: lessons from the Monterrey water utility intrusion

Dragos' May 2026 report on Servicios de Agua y Drenaje de Monterrey documents the first publicly analysed campaign in which a commercial LLM — Claude — was the primary technical operator of an attempted OT intrusion.

2026-05-26//7 min
MULTIMODAL CRITICAL

AudioHijack: imperceptible audio hijacks voice agents (IEEE S&P 2026)

An April 16, 2026 IEEE S&P paper introduces auditory prompt injection: adversarial reverb hidden in audio drives 13 large audio-language models and commercial voice agents (Mistral AI, Microsoft Azure) into unauthorized actions with 79-96% success.

2026-05-26//7 min
INDIRECT INJECTION MEDIUM

Discourse AI XSS (CVE-2026-27740): when LLM output is trusted as HTML

A flagged post, an AI moderator, an htmlSafe call. The Discourse AI plugin treated LLM output as trusted markup, turning indirect prompt injection into Staff-side XSS. Published March 19, 2026.

2026-05-26//6 min
AGENTS CRITICAL

The Lethal Trifecta: when an agent reads private data, untrusted content, and can phone home

Simon Willison's framework for the single architectural mistake that turned 2026's wave of AI-agent data exfiltration vulnerabilities into a class, not a coincidence.

2026-05-26//7 min
AGENTS MEDIUM

MCP Back-End Vulnerabilities: classic flaws resurface across AI database bridges

Akamai's May 12, 2026 research found SQL injection (CVE-2025-66335), missing authentication, and unsanitised inputs across three MCP servers — Apache Doris, Apache Pinot, and Alibaba RDS. The pattern, not the bugs, is the story.

2026-05-26//7 min
OFFENSIVE AI MEDIUM

OpenAI Daybreak and GPT-5.5-Cyber: a permissive security model behind a verified-identity gate

Between May 7 and 12, 2026, OpenAI launched Daybreak — a cybersecurity platform built on GPT-5.5, Codex Security and a 'cyber-permissive' sibling, GPT-5.5-Cyber. UK AISI's prior evaluation found a universal jailbreak in six hours.

2026-05-26//7 min
DEFENSE MEDIUM

Project Glasswing: 10,000+ critical bugs found by Claude Mythos in a month

Anthropic's May 26, 2026 update on Project Glasswing reports that ~50 partners have used Claude Mythos Preview to find more than 10,000 high/critical-severity vulnerabilities, including 271 latent bugs patched in Firefox 150 — and lays out a controlled-access model for a frontier offensive capability.

2026-05-26//7 min
AGENTS CRITICAL

Semantic Kernel: when a prompt becomes a shell (CVE-2026-25592, CVE-2026-26030)

Microsoft disclosed two critical vulnerabilities in Semantic Kernel on May 7, 2026 that turn a single injected prompt into host-level code execution. The root cause is architectural: tool registries and eval() treated as features, not security boundaries.

2026-05-26//7 min
SUPPLY CHAIN MEDIUM

Hidden triggers in SKILL.md: semantic supply-chain attacks on agent skill registries

A May 12, 2026 University of Maryland paper shows that 20-token additions to a SKILL.md file can make an agent discover and select an adversarial skill in 77–86% of trials, and bypass registry-side scans up to 100% of the time.

2026-05-26//7 min
AGENTS MEDIUM

Trust No Tool: cognitive poisoning of LLM agents through tool feedback

A May 17, 2026 arXiv paper introduces 'cognitive poisoning' — a malicious tool that wins the agent's trust over many benign-looking turns and only weaponises the final action. The defence target shifts from prompts to trajectory.

2026-05-26//7 min
ADVERSARIAL MEDIUM

Usability as a Weapon: how feature requests turn coding LLMs insecure

A May 11, 2026 arXiv paper shows that asking a coding LLM for a faster, simpler or feature-richer version of secure code reliably drops the security constraints. UPAttack reaches 98.1% on GPT-5.2-chat and Gemini-3.

2026-05-26//7 min
DEFENSE MEDIUM

Agents Rule of Two: Meta's pragmatic answer to unsolved prompt injection

Published Oct 31, 2025 by Meta and re-adopted in Databricks' May 2026 guide, the Agents Rule of Two limits any agent session to two of three risky properties — the most actionable framework while prompt injection remains unsolved.

2026-05-25//6 min
AGENTS CRITICAL

Azure SRE Agent: a multi-tenant token check that let strangers watch your incidents (CVE-2026-32173)

Disclosed April 20, 2026, an Entra ID app-registration misconfiguration on Azure SRE Agent's /agentHub WebSocket let any tenant connect, listen to every prompt, reasoning step, CLI command and credential — silently.

2026-05-25//7 min
AGENTS CRITICAL

CVE-2026-35435: Azure AI Foundry's M365 published agents trusted callers they shouldn't have

Disclosed May 7, 2026 (CVSS 8.6), an improper access-control flaw in Azure AI Foundry let unauthorized attackers elevate privilege through M365 published agents. Microsoft reports active exploitation; mitigations are available before a patch.

2026-05-25//6 min
AGENTS CRITICAL

Claw Chain: four OpenClaw CVEs that turn an AI agent into the attacker's hands

Disclosed May 15, 2026, Cyera Research's Claw Chain chains four patched OpenClaw flaws — sandbox escape, env-var disclosure, MCP loopback EoP, symlink read escape — into full host takeover via the agent itself.

2026-05-25//7 min
AGENTS CRITICAL

Comment and Control: one prompt injection pattern, three vendors leaking GitHub Actions secrets

Disclosed April 15, 2026, Comment and Control turns ordinary PR titles, issue bodies and HTML comments into credential-exfiltration channels in Claude Code, Gemini CLI and GitHub Copilot Agent.

2026-05-25//7 min
RESEARCH MEDIUM

Contextual integrity: why prompt-injection defenses keep failing

A May 2026 paper by Abdelnabi and Bagdasarian recasts prompt injection through Contextual Integrity and shows that data-instruction separation is a category mistake.

2026-05-25//6 min
PROMPT INJECTION CRITICAL

Copirate 365: chaining prompt injection, delayed tool invocation and memory hijack in M365 Copilot (CVE-2026-24299)

Johann Rehberger's DEF CON writeup, published May 2026, walks through a five-stage indirect prompt-injection chain that turns one booby-trapped email into a persistent backdoor inside Microsoft 365 Copilot. Patched, but the patterns are generic.

2026-05-25//7 min
INDIRECT INJECTION MEDIUM

Indirect prompt injection in the wild: three April 2026 studies converge

Google, Forcepoint and CISPA independently measured indirect prompt injection across the open web in April 2026. The picture: 15K+ validated payloads, 32% growth, organized templates.

2026-05-25//7 min
INFRASTRUCTURE CRITICAL

LiteLLM CVE-2026-42208: a pre-auth SQL injection in the AI gateway

Disclosed April 20, 2026 and exploited 36 hours after the global advisory dropped, CVE-2026-42208 turns LiteLLM's Authorization header into a direct read on every provider key the proxy fronts.

2026-05-25//6 min
JAILBREAK MEDIUM

Mathematical encoding jailbreaks: when set theory bypasses LLM safety

An arXiv paper posted on May 5, 2026 shows that re-expressing a harmful prompt as a set-theory or formal-logic problem bypasses safety training on 46–56% of attempts across eight frontier models — but only when a helper LLM does the reformulation, not when mathematical syntax is bolted on top.

2026-05-25//7 min
RESEARCH MEDIUM

When the attacker is another LLM: large reasoning models as autonomous jailbreakers

A Nature Communications paper formalised in May 2026 shows four reasoning models — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini and Qwen3 235B — jailbreaking nine target LLMs with a 97.14% overall success rate, armed with nothing but a single system prompt.

2026-05-25//6 min
AGENTS CRITICAL

PraisonAI CVE-2026-44338: an unauthenticated agent server, exploited in 3h44

Disclosed May 11, 2026, CVE-2026-44338 ships PraisonAI with authentication hard-disabled in its legacy API server. A CVE-Detector scanner hit the endpoint less than four hours later.

2026-05-25//6 min
INDIRECT INJECTION MEDIUM

ShareLeak (CVE-2026-21520): the first CVE Microsoft assigned to a Copilot prompt injection

Disclosed April 15, 2026, Capsule Security's ShareLeak write-up details an indirect prompt injection in Microsoft Copilot Studio. Microsoft assigned CVE-2026-21520 (CVSS 7.5) — an unusual industry first that reframes prompt injection as a tracked vulnerability class.

2026-05-25//7 min
DEFENSE MEDIUM

ARGUS: a provenance-graph defense for context-aware prompt injection

Published May 5, 2026, the ARGUS paper introduces influence-provenance auditing for LLM agents — dropping attack success from 28.8% to 3.8% on a new context-aware injection benchmark.

2026-05-22//7 min
DEFENSE MEDIUM

The Instruction Hierarchy: training LLMs to rank privileged instructions

OpenAI's 2024 paper proposes a structural defense against prompt injection: teach models that system > user > tool output. The idea is now central to GPT-4o-mini and o-series safety training.

2026-05-22//7 min
INFRASTRUCTURE CRITICAL

LMDeploy SSRF: when an image loader turns into an AI-infrastructure hijack

CVE-2026-33626 turned LMDeploy's load_image() into a generic SSRF primitive. Honeypots saw the first weaponised exploit 12 hours and 31 minutes after the advisory went live.

2026-05-22//6 min
AGENTS CRITICAL

Localhost agent hijack: cross-origin WebSocket attacks on AI coding agents

CVE-2026-44211 (CVSS 9.7), disclosed May 7, 2026, shows how a single visit to a malicious page can hijack an AI coding agent running on a developer's laptop. The attack class is generic — and architectural.

2026-05-22//7 min
SUPPLY CHAIN CRITICAL

Mini Shai-Hulud: the supply-chain worm that came for the AI tooling stack

Disclosed May 11–18, 2026, the Mini Shai-Hulud worm trojanised 170+ npm and PyPI packages — including Mistral AI, Guardrails AI and TanStack — and persists inside Claude Code and VS Code.

2026-05-22//7 min
DEFENSE MEDIUM

Output filtering beats model self-defense: 20,000 adaptive attacks, one survivor

Posted April 26 and revised May 12, 2026, a Swept AI / Michigan paper pitted nine prompt-injection defenses against an adaptive attacker. Every model-side defense eventually broke. Application-side output filtering held — zero leaks across 15,000 attacks.

2026-05-22//6 min
AGENTS CRITICAL

Prompts as shells: when prompt injection becomes RCE in agent frameworks

Two CVEs disclosed in Microsoft Semantic Kernel on May 7, 2026 (CVE-2026-25592, CVE-2026-26030) show how a single injected prompt can pivot from text to remote code execution on the agent's host.

2026-05-22//7 min
PROMPT INJECTION CRITICAL

ASCII Smuggling: Hidden commands via Unicode Tag characters

Unicode Tag characters (U+E0000–U+E007F) are invisible to humans but interpreted by LLMs. Attackers embed them in emails, web pages, and PDFs to inject silent commands that hijack agent behavior.

2026-05-19//8 min
JAILBREAK CRITICAL

Many-shot jailbreaking: 256 examples to bypass any alignment

Anthropic researchers showed that stuffing the context window with 256 fake Q&A examples reliably bypasses safety training. Bigger context = bigger attack surface.

2026-05-15//6 min
DATA LEAK CRITICAL

System prompt extraction via repetition attacks

Asking the model to 'repeat the word poem forever' causes it to eventually dump training data and system prompts. Documented across Claude 3, GPT-4, and Gemini.

2026-05-10//4 min
RESEARCH LOW

Sleeper agents: hidden backdoors that survive safety training

Anthropic demonstrated that models trained with hidden trigger phrases retain backdoor behavior even after standard RLHF safety training. The implications for open-weight LLMs are significant.

2026-05-03//14 min