LLM-Hacking

SymJack: one approved file copy becomes RCE in six AI coding agents

Sat, 30 May 2026 00:00:00 GMT

Adversa AI disclosed on May 26, 2026 a symlink-hijack pattern that turns a single benign-looking shell copy into a config overwrite and host RCE across Claude Code, Cursor, Gemini, Antigravity, Copilot, Grok Build and Codex CLIs.

Slopsquatting in 2026: 127 package names that all five frontier LLMs hallucinate

Fri, 29 May 2026 00:00:00 GMT

A May 16, 2026 arXiv replication of the USENIX Security '25 slopsquatting study finds hallucination rates are down across frontier models — but identifies 127 phantom packages that every tested model invents identically, a model-agnostic supply-chain attack surface.

Blindfold: action-level jailbreaks bypass semantic defenses on embodied LLMs

Fri, 29 May 2026 00:00:00 GMT

A SenSys '26 paper (May 11–14, 2026) introduces Blindfold, an automated framework that jailbreaks embodied LLMs by decomposing harmful goals into individually benign actions — up to 53% higher attack success than semantic-level baselines on a real 6DoF robotic arm.

MCPwn (CVE-2026-33032): nginx-ui MCP endpoint hands over the web server

Fri, 29 May 2026 00:00:00 GMT

An unauthenticated MCP endpoint in nginx-ui ≤ 2.3.3 lets any network attacker rewrite nginx configs and restart the service. CVSS 9.8, publicly disclosed on April 15, 2026, exploited in the wild within hours of the patch.

Measuring LLM exploit capability: ExploitBench, ExploitGym and the SCONE-bench refresh

Fri, 29 May 2026 00:00:00 GMT

On May 22, 2026 Anthropic published Mythos Preview results on three new exploitation benchmarks. The numbers — and the way the benchmarks decompose the exploit chain — change how defenders should think about frontier offensive capability.

Proprietary Problems: Cisco's 15-model paired-regime study shows single-turn safety scores miss most multi-turn risk

Fri, 29 May 2026 00:00:00 GMT

A May 27, 2026 Cisco study of 15 flagship closed models from OpenAI, Anthropic, Google, Amazon and xAI records multi-turn attack success rates of 7.89% to 88.30% — and cross-regime gaps up to 55 percentage points over single-turn baselines.

One million exposed AI services: what the Intruder scan actually found

Fri, 29 May 2026 00:00:00 GMT

On May 5, 2026, Intruder published the results of an internet-wide scan that mapped 1 million exposed AI services across 2 million hosts. The recurring failure is not exotic — it is permissive defaults.

The agent-human security gap: what production ships, what papers study

Fri, 29 May 2026 00:00:00 GMT

A May 23, 2026 UCLA paper audits 59 academic studies, 21 production agent systems and 26 security plugins — and finds that the defenses researchers favor have zero production deployment.

The Autonomy Tax: how defense training breaks LLM agents

Fri, 29 May 2026 00:00:00 GMT

A March 19, 2026 USC paper measures the cost of prompt-injection-defense training on agent competence — defended models time out on 99% of tasks, vs 13% for undefended baselines.

MCP needs a trust handshake: attested tool-server admission

Fri, 29 May 2026 00:00:00 GMT

A May 22, 2026 arXiv paper proposes mcp-attested — a backward-compatible MCP extension that gates tool dispatch on signed clearance, deny-by-default allowlists, and tamper-evident audit logs.

WARD: a co-evolved guard model that holds up against adaptive prompt injection on web agents

Fri, 29 May 2026 00:00:00 GMT

A May 14, 2026 NUS paper proposes WARD — a guard model trained against a memory-driven adversarial attacker — and reports near-perfect out-of-distribution recall on web-agent prompt injection.

MemMorph: hijacking tool selection in LLM agents through fluent memory poisoning

Fri, 29 May 2026 00:00:00 GMT

A May 24, 2026 arXiv paper from NTU Singapore shows three plausible-looking memory entries can steer an agent toward an attacker-chosen tool with 85.9% success — and survive three off-the-shelf defenses.

SilentRetrieval: fluent RAG corpus poisoning that slips past perplexity filters

Fri, 29 May 2026 00:00:00 GMT

A May 27, 2026 arXiv preprint introduces a two-stage attack that hides goal-hijacking triggers inside fluent documents, reaching 57% LLM-attack success on Natural Questions and MS MARCO with one poisoned record per query.

CISA + Five Eyes publish the first joint guidance on agentic-AI adoption

Thu, 28 May 2026 00:00:00 GMT

On May 1, 2026, CISA, NSA and the Five Eyes cyber agencies released 'Careful Adoption of Agentic AI Services' — a 5-risk taxonomy and a deployment playbook that critical-infrastructure operators are now expected to fold into their existing cybersecurity frameworks.

Microsoft Copilot Cowork: poisoned skills exfiltrate M365 files with no approval

Thu, 28 May 2026 00:00:00 GMT

PromptArmor's May 26, 2026 disclosure shows that a five-line prompt injection inside a Copilot Cowork skill file can leak SharePoint and OneDrive documents through auto-approved Teams messages — no patch closes the design.

CrossMPI: image-only prompt injection steers what VLMs read and see

Thu, 28 May 2026 00:00:00 GMT

A May 15, 2026 Xidian University arXiv paper introduces CrossMPI: imperceptible image perturbations that change how vision-language models interpret both the image and the user's text prompt, with 66% average success across five LVLMs.

IterInject: when an LLM optimiser writes its own indirect prompt injections

Thu, 28 May 2026 00:00:00 GMT

A May 23, 2026 paper closes the loop between payload, diagnoser and LLM optimiser — lifting indirect-injection ASR from near-zero to 33–90% on InjecAgent and compromising 5 of 9 Claude Code targets.

NSA AISC publishes MCP security design guidance for production AI

Thu, 28 May 2026 00:00:00 GMT

On May 20, 2026, NSA's Artificial Intelligence Security Center released a 15-page Cybersecurity Information Sheet on Model Context Protocol — eight classes of weakness, five real-world incidents, nine defensive recommendations.

Poisoning the Watchtower: when SOC copilots read attacker-controlled logs

Thu, 28 May 2026 00:00:00 GMT

A May 23, 2026 paper formalises log-substrate prompt injection — adversarial content in log fields steering LLM-based SOC assistants. Best defense leaves 11.8% average injection success.

pgAdmin 4 ships an LLM panel and a classic LFI+SSRF arrives with it (CVE-2026-7817)

Thu, 28 May 2026 00:00:00 GMT

pgAdmin 4 9.15 patches an authenticated LFI and SSRF in its new LLM API configuration endpoints. The bug class is decades old; the surface is brand new.

Temporal memory contamination: longitudinal safety drift in memory-equipped LLM agents

Thu, 28 May 2026 00:00:00 GMT

Three arXiv papers from April and May 2026 converge on a failure mode complementary to memory poisoning — memory-equipped agents drift unsafe as benign context accumulates, with compressed summaries acting as a laundering channel.

The pressure: open-source security teams under the AI-assisted vulnerability flood

Thu, 28 May 2026 00:00:00 GMT

On May 26, 2026, curl's Daniel Stenberg published 'The pressure' — more than one credible security report per day, twelve confirmed CVEs in half a release cycle, and a pattern other maintainers are now reporting in parallel.

The agent harness is your real privilege boundary — and most teams draw it in the wrong place

Thu, 28 May 2026 00:00:00 GMT

A May 26, 2026 Pillar Security write-up argues the harness — Claude Code, Cursor, Codex — holds the secrets, tools and hooks an agent never sees. Recent harness bugs and CVE-2026-22708 make the case concrete.

Sockpuppeting: a one-line prefill that jailbreaks 11 production LLMs

Thu, 28 May 2026 00:00:00 GMT

A line of code injected as the last assistant message coaxes 7 of 10 major models into harmful completions. The fix is not at the model — it is API-side message-order validation.

GrafanaGhost: indirect prompt injection chained with a URL-parse bug to exfiltrate dashboard data

Thu, 28 May 2026 00:00:00 GMT

Noma Security's April 7, 2026 disclosure shows how three modest defects — a stored injection point, a startsWith('/') URL check, and a one-word guardrail bypass — combine into a silent exfiltration path through Grafana's AI assistant.

Networks of agents break in new ways: Microsoft's red-team, plus RAMPART and Clarity

Wed, 27 May 2026 00:00:00 GMT

Microsoft Research red-teamed an internal platform of 100+ always-on agents. Four attack patterns — propagation, amplification, trust capture, proxy chains — show up only at the network level. RAMPART and Clarity, open-sourced May 20, 2026, are the response.

Antigravity find_by_name: when a native tool call jumps over Secure Mode

Wed, 27 May 2026 00:00:00 GMT

On April 20, 2026, Pillar Security disclosed that a single unsanitised parameter in Google Antigravity's find_by_name tool turned file search into arbitrary code execution — and bypassed the IDE's strictest sandbox.

Apple's May 2026 bulletin formally credits Claude on two macOS CVEs

Wed, 27 May 2026 00:00:00 GMT

On May 11, 2026, Apple's macOS Tahoe 26.5 advisory named Claude alongside its researchers on two CVEs — a kernel integer overflow and a WebKit use-after-free. AI-assisted vulnerability research is now in the official changelog.

BadHost (CVE-2026-48710): one Host-header character bypasses auth in Starlette, vLLM and FastMCP

Wed, 27 May 2026 00:00:00 GMT

X41 D-Sec disclosed on May 22, 2026 a critical auth bypass in Starlette < 1.0.1. A single / ? or # in the HTTP Host header desynchronises the routed path from the path the middleware sees, breaking path-based authorization in vLLM, LiteLLM, FastMCP and thousands of FastAPI-based AI agents.

Bleeding Llama: a GGUF parsing flaw leaks Ollama process memory to unauthenticated attackers

Wed, 27 May 2026 00:00:00 GMT

CVE-2026-7482, publicly disclosed in May 2026 and codenamed Bleeding Llama by Cyera, lets a remote attacker pull arbitrary chunks of an Ollama server's heap — API keys, system prompts, other users' conversations — with three unauthenticated API calls. The silent patch shipped 2.5 months before the CVE was assigned.

ClaudeBleed: when a browser agent trusts the wrong extension

Wed, 27 May 2026 00:00:00 GMT

LayerX disclosed ClaudeBleed on May 6, 2026: a trust-boundary flaw let any Chrome extension drive Claude in Chrome and exfiltrate Gmail, Drive and GitHub data. The first patch was bypassed within hours.

Encoded prompt injection: when guardrails fail because the LLM decodes the payload

Wed, 27 May 2026 00:00:00 GMT

On May 4, 2026 a tweet written in Morse code drained around $175K from a Grok-controlled crypto wallet. The incident is the most expensive demonstration to date of an old defensive blind spot — string-matching guardrails can't see through encodings that the model itself happily decodes.

The first CVE wave: AI-assisted discovery is reshaping disclosure volumes

Wed, 27 May 2026 00:00:00 GMT

VulnCheck's May 14, 2026 analysis shows year-to-date CVE issuance up +563% on Chrome, +476% on GitHub, +180% on VMware, +170% on Apache. The systemic shift behind the Apple, Mozilla and ActiveMQ headlines is now visible in the numbers.

Font-mapping prompt injection: when peer review becomes an LLM attack surface

Wed, 27 May 2026 00:00:00 GMT

A May 25, 2026 arXiv benchmark shows hidden font-mapping payloads can flip LLM peer reviews from reject to accept. ICML 2026 already used the same trick in reverse to desk-reject 497 papers.

MCP STDIO transport: the design choice that became 11 CVEs and 200,000 exposed agents

Wed, 27 May 2026 00:00:00 GMT

On April 16, 2026, OX Security disclosed that Anthropic's MCP STDIO transport executes any OS command it is handed. Anthropic called it 'by design'. The cascade has produced eleven downstream CVEs in six weeks.

MultiBreak: 10,389 multi-turn prompts expose how conversational jailbreaks slip past LLM safety

Wed, 27 May 2026 00:00:00 GMT

A May 3, 2026 ICML paper releases the largest, most diverse multi-turn jailbreak benchmark to date. It records attack-success-rate gaps of up to 54 points over the previous state of the art on DeepSeek-R1-7B and 34.6 on GPT-4.1-mini — and quantifies how alignment that holds in single turns collapses across follow-ups.

When prompts become shells: prompt injection escalates to RCE in agent frameworks

Wed, 27 May 2026 00:00:00 GMT

Two CVEs in Microsoft Semantic Kernel and four in CrewAI — all disclosed in early 2026 — turn a single injected prompt into remote code execution on the host. The pattern is structural, not incidental.

Teaching Claude Why: how Anthropic drove agentic misalignment to zero

Wed, 27 May 2026 00:00:00 GMT

On May 8, 2026, Anthropic's Alignment Science team published a case study showing that teaching Claude to explain its ethical reasoning — not just demonstrate it — cut agentic misalignment from 96% to under 1%.

Poison once, exploit forever: persistent memory poisoning of LLM agents (OWASP ASI06)

Tue, 26 May 2026 00:00:00 GMT

An April 2026 arXiv paper on cross-site memory poisoning and a May 13, 2026 OWASP post on the Cisco MemoryTrap finding against Claude Code converge on the same lesson: agent memory is a trust boundary.

Treating AI agents like operating systems: a CISPA blueprint for isolation and privilege

Tue, 26 May 2026 00:00:00 GMT

A May 14, 2026 CISPA paper applies decades of OS security thinking to LLM agents. Tested on four OpenClaw-like systems, two weakness classes — cross-user exfiltration and unauthorized network egress — fail in every single one.

AI-assisted ICS attack: lessons from the Monterrey water utility intrusion

Tue, 26 May 2026 00:00:00 GMT

Dragos' May 2026 report on Servicios de Agua y Drenaje de Monterrey documents the first publicly analysed campaign in which a commercial LLM — Claude — was the primary technical operator of an attempted OT intrusion.

AudioHijack: imperceptible audio hijacks voice agents (IEEE S&P 2026)

Tue, 26 May 2026 00:00:00 GMT

An April 16, 2026 IEEE S&P paper introduces auditory prompt injection: adversarial reverb hidden in audio drives 13 large audio-language models and commercial voice agents (Mistral AI, Microsoft Azure) into unauthorized actions with 79-96% success.

Discourse AI XSS (CVE-2026-27740): when LLM output is trusted as HTML

Tue, 26 May 2026 00:00:00 GMT

A flagged post, an AI moderator, an htmlSafe call. The Discourse AI plugin treated LLM output as trusted markup, turning indirect prompt injection into Staff-side XSS. Published March 19, 2026.

The Lethal Trifecta: when an agent reads private data, untrusted content, and can phone home

Tue, 26 May 2026 00:00:00 GMT

Simon Willison's framework for the single architectural mistake that turned 2026's wave of AI-agent data exfiltration vulnerabilities into a class, not a coincidence.

MCP Back-End Vulnerabilities: classic flaws resurface across AI database bridges

Tue, 26 May 2026 00:00:00 GMT

Akamai's May 12, 2026 research found SQL injection (CVE-2025-66335), missing authentication, and unsanitised inputs across three MCP servers — Apache Doris, Apache Pinot, and Alibaba RDS. The pattern, not the bugs, is the story.

OpenAI Daybreak and GPT-5.5-Cyber: a permissive security model behind a verified-identity gate

Tue, 26 May 2026 00:00:00 GMT

Between May 7 and 12, 2026, OpenAI launched Daybreak — a cybersecurity platform built on GPT-5.5, Codex Security and a 'cyber-permissive' sibling, GPT-5.5-Cyber. UK AISI's prior evaluation found a universal jailbreak in six hours.

Project Glasswing: 10,000+ critical bugs found by Claude Mythos in a month

Tue, 26 May 2026 00:00:00 GMT

Anthropic's May 26, 2026 update on Project Glasswing reports that ~50 partners have used Claude Mythos Preview to find more than 10,000 high/critical-severity vulnerabilities, including 271 latent bugs patched in Firefox 150 — and lays out a controlled-access model for a frontier offensive capability.

Semantic Kernel: when a prompt becomes a shell (CVE-2026-25592, CVE-2026-26030)

Tue, 26 May 2026 00:00:00 GMT

Microsoft disclosed two critical vulnerabilities in Semantic Kernel on May 7, 2026 that turn a single injected prompt into host-level code execution. The root cause is architectural: tool registries and eval() treated as features, not security boundaries.

Hidden triggers in SKILL.md: semantic supply-chain attacks on agent skill registries

Tue, 26 May 2026 00:00:00 GMT

A May 12, 2026 University of Maryland paper shows that 20-token additions to a SKILL.md file can make an agent discover and select an adversarial skill in 77–86% of trials, and bypass registry-side scans up to 100% of the time.

Trust No Tool: cognitive poisoning of LLM agents through tool feedback

Tue, 26 May 2026 00:00:00 GMT

A May 17, 2026 arXiv paper introduces 'cognitive poisoning' — a malicious tool that wins the agent's trust over many benign-looking turns and only weaponises the final action. The defence target shifts from prompts to trajectory.

Usability as a Weapon: how feature requests turn coding LLMs insecure

Tue, 26 May 2026 00:00:00 GMT

A May 11, 2026 arXiv paper shows that asking a coding LLM for a faster, simpler or feature-richer version of secure code reliably drops the security constraints. UPAttack reaches 98.1% on GPT-5.2-chat and Gemini-3.

Agents Rule of Two: Meta's pragmatic answer to unsolved prompt injection

Mon, 25 May 2026 00:00:00 GMT

Published Oct 31, 2025 by Meta and re-adopted in Databricks' May 2026 guide, the Agents Rule of Two limits any agent session to two of three risky properties — the most actionable framework while prompt injection remains unsolved.

Azure SRE Agent: a multi-tenant token check that let strangers watch your incidents (CVE-2026-32173)

Mon, 25 May 2026 00:00:00 GMT

Disclosed April 20, 2026, an Entra ID app-registration misconfiguration on Azure SRE Agent's /agentHub WebSocket let any tenant connect, listen to every prompt, reasoning step, CLI command and credential — silently.

CVE-2026-35435: Azure AI Foundry's M365 published agents trusted callers they shouldn't have

Mon, 25 May 2026 00:00:00 GMT

Disclosed May 7, 2026 (CVSS 8.6), an improper access-control flaw in Azure AI Foundry let unauthorized attackers elevate privilege through M365 published agents. Microsoft reports active exploitation; mitigations are available before a patch.

Claw Chain: four OpenClaw CVEs that turn an AI agent into the attacker's hands

Mon, 25 May 2026 00:00:00 GMT

Disclosed May 15, 2026, Cyera Research's Claw Chain chains four patched OpenClaw flaws — sandbox escape, env-var disclosure, MCP loopback EoP, symlink read escape — into full host takeover via the agent itself.

Comment and Control: one prompt injection pattern, three vendors leaking GitHub Actions secrets

Mon, 25 May 2026 00:00:00 GMT

Disclosed April 15, 2026, Comment and Control turns ordinary PR titles, issue bodies and HTML comments into credential-exfiltration channels in Claude Code, Gemini CLI and GitHub Copilot Agent.

Contextual integrity: why prompt-injection defenses keep failing

Mon, 25 May 2026 00:00:00 GMT

A May 2026 paper by Abdelnabi and Bagdasarian recasts prompt injection through Contextual Integrity and shows that data-instruction separation is a category mistake.

Copirate 365: chaining prompt injection, delayed tool invocation and memory hijack in M365 Copilot (CVE-2026-24299)

Mon, 25 May 2026 00:00:00 GMT

Johann Rehberger's DEF CON writeup, published May 2026, walks through a five-stage indirect prompt-injection chain that turns one booby-trapped email into a persistent backdoor inside Microsoft 365 Copilot. Patched, but the patterns are generic.

Indirect prompt injection in the wild: three April 2026 studies converge

Mon, 25 May 2026 00:00:00 GMT

Google, Forcepoint and CISPA independently measured indirect prompt injection across the open web in April 2026. The picture: 15K+ validated payloads, 32% growth, organized templates.

LiteLLM CVE-2026-42208: a pre-auth SQL injection in the AI gateway

Mon, 25 May 2026 00:00:00 GMT

Disclosed April 20, 2026 and exploited 36 hours after the global advisory dropped, CVE-2026-42208 turns LiteLLM's Authorization header into a direct read on every provider key the proxy fronts.

Mathematical encoding jailbreaks: when set theory bypasses LLM safety

Mon, 25 May 2026 00:00:00 GMT

An arXiv paper posted on May 5, 2026 shows that re-expressing a harmful prompt as a set-theory or formal-logic problem bypasses safety training on 46–56% of attempts across eight frontier models — but only when a helper LLM does the reformulation, not when mathematical syntax is bolted on top.

When the attacker is another LLM: large reasoning models as autonomous jailbreakers

Mon, 25 May 2026 00:00:00 GMT

A Nature Communications paper formalised in May 2026 shows four reasoning models — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini and Qwen3 235B — jailbreaking nine target LLMs with a 97.14% overall success rate, armed with nothing but a single system prompt.

PraisonAI CVE-2026-44338: an unauthenticated agent server, exploited in 3h44

Mon, 25 May 2026 00:00:00 GMT

Disclosed May 11, 2026, CVE-2026-44338 ships PraisonAI with authentication hard-disabled in its legacy API server. A CVE-Detector scanner hit the endpoint less than four hours later.

ShareLeak (CVE-2026-21520): the first CVE Microsoft assigned to a Copilot prompt injection

Mon, 25 May 2026 00:00:00 GMT

Disclosed April 15, 2026, Capsule Security's ShareLeak write-up details an indirect prompt injection in Microsoft Copilot Studio. Microsoft assigned CVE-2026-21520 (CVSS 7.5) — an unusual industry first that reframes prompt injection as a tracked vulnerability class.

ARGUS: a provenance-graph defense for context-aware prompt injection

Fri, 22 May 2026 00:00:00 GMT

Published May 5, 2026, the ARGUS paper introduces influence-provenance auditing for LLM agents — dropping attack success from 28.8% to 3.8% on a new context-aware injection benchmark.

The Instruction Hierarchy: training LLMs to rank privileged instructions

Fri, 22 May 2026 00:00:00 GMT

OpenAI's 2024 paper proposes a structural defense against prompt injection: teach models that system > user > tool output. The idea is now central to GPT-4o-mini and o-series safety training.

LMDeploy SSRF: when an image loader turns into an AI-infrastructure hijack

Fri, 22 May 2026 00:00:00 GMT

CVE-2026-33626 turned LMDeploy's load_image() into a generic SSRF primitive. Honeypots saw the first weaponised exploit 12 hours and 31 minutes after the advisory went live.

Localhost agent hijack: cross-origin WebSocket attacks on AI coding agents

Fri, 22 May 2026 00:00:00 GMT

CVE-2026-44211 (CVSS 9.7), disclosed May 7, 2026, shows how a single visit to a malicious page can hijack an AI coding agent running on a developer's laptop. The attack class is generic — and architectural.

Mini Shai-Hulud: the supply-chain worm that came for the AI tooling stack

Fri, 22 May 2026 00:00:00 GMT

Disclosed May 11–18, 2026, the Mini Shai-Hulud worm trojanised 170+ npm and PyPI packages — including Mistral AI, Guardrails AI and TanStack — and persists inside Claude Code and VS Code.

Output filtering beats model self-defense: 20,000 adaptive attacks, one survivor

Fri, 22 May 2026 00:00:00 GMT

Posted April 26 and revised May 12, 2026, a Swept AI / Michigan paper pitted nine prompt-injection defenses against an adaptive attacker. Every model-side defense eventually broke. Application-side output filtering held — zero leaks across 15,000 attacks.

Prompts as shells: when prompt injection becomes RCE in agent frameworks

Fri, 22 May 2026 00:00:00 GMT

Two CVEs disclosed in Microsoft Semantic Kernel on May 7, 2026 (CVE-2026-25592, CVE-2026-26030) show how a single injected prompt can pivot from text to remote code execution on the agent's host.

ASCII Smuggling: Hidden commands via Unicode Tag characters

Tue, 19 May 2026 00:00:00 GMT

Unicode Tag characters (U+E0000–U+E007F) are invisible to humans but interpreted by LLMs. Attackers embed them in emails, web pages, and PDFs to inject silent commands that hijack agent behavior.

Many-shot jailbreaking: 256 examples to bypass any alignment

Fri, 15 May 2026 00:00:00 GMT

Anthropic researchers showed that stuffing the context window with 256 fake Q&A examples reliably bypasses safety training. Bigger context = bigger attack surface.

System prompt extraction via repetition attacks

Sun, 10 May 2026 00:00:00 GMT

Asking the model to 'repeat the word poem forever' causes it to eventually dump training data and system prompts. Documented across Claude 3, GPT-4, and Gemini.

Sleeper agents: hidden backdoors that survive safety training

Sun, 03 May 2026 00:00:00 GMT

Anthropic demonstrated that models trained with hidden trigger phrases retain backdoor behavior even after standard RLHF safety training. The implications for open-weight LLMs are significant.