DEFENSE

SherAgent: LLM-driven attack investigation and the trust it inherits

A July 2026 paper puts an LLM agent in the SOC loop to reconstruct attacks from provenance graphs. It is a real capability gain — and a reminder that any agent reasoning over attacker-touched logs inherits an injection surface.

2026-07-17//6 min

Agentic secret scanning: when an LLM maps a leaked credential to what it unlocks

A July 2026 research paper describes an LLM agent that not only finds credentials leaked in documents but reasons about the blast radius each one opens. A defensive tool with an obvious dual-use edge.

GPT-Red: training an attacker model to harden defenders against injection

On July 15, 2026, OpenAI described GPT-Red, an internal red-teaming model trained by self-play to find prompt injections. It beat humans 84% to 13% — and was then used to make GPT-5.6 more robust.

Catching agent memory poisoning from tool-call logs alone

A June 2026 study shows memory-channel poisoning leaves a forensic fingerprint in an agent's tool-call trajectory — a recall-before-send pattern detectable without touching memory, model weights, or message content.

Proving which agent produced a log, when the reseller owns the log

TRACE, published July 9, 2026, watermarks an agent's trajectory itself — surviving a reseller who can delete and rewrite the very log that provenance is judged from.

SingGuard-NSFA: an open-source guardrail built for agent execution, not just content

Ant Group open-sourced a guardrail family that screens an agent's requests and actions before they run — 185 threat scenarios, 133 languages, and ~50 ms classification latency.

Why fine-tuning collapses safety guardrails: the alignment-similarity effect

An ACL 2026 study finds that safety alignment breaks after fine-tuning largely because the fine-tuning data resembles the original alignment data — an upstream design problem, not just a downstream accident.

Context bombs: defensive prompt injection against attacker AI agents

A mid-July 2026 Tracebit study hides short guardrail-tripping strings inside decoy secrets, cutting five offensive AI agents' full-admin success from roughly 57% to 5% in an AWS cyber range.

Cyber deception works better on AI attackers than on humans

A June 2026 study ran a 21-model attacker cohort against classic deception traps and found every model took the bait more often than humans — and kept taking it even after naming the trap.

A lambda calculus that proves agents resist prompt injection

A formal calculus for AI agents models conversations, tool calls and code execution as first-class terms — and proves a noninterference theorem showing information-flow control can contain prompt injection.

Cross-Site Prompting: the XSS-shaped threat facing web agents

A UC Berkeley paper names the web-agent analogue of XSS — Cross-Site Prompting — and proposes a system-level confinement layer that cuts attack success from 85.5% to 0.7% without touching the site.

RAGCharacter: character-level traceback of poisoned spans in RAG evidence

A May 2026 preprint proposes black-box, character-level forensics that pinpoints the exact poisoned span inside a retrieved chunk after a RAG system misbehaves, instead of quarantining whole passages.

Defending content from agentic crawlers at the compression layer

A July 2026 paper argues context compression — not access control — is the unguarded layer where AI agents strip web content, and that invisible perturbations can survive it to protect data.

Four gates against multi-turn jailbreaks that no single message reveals

A July 2026 paper interposes an independent oversight model with four gates — intent, zero-trust context, cross-turn consistency, and output risk — to catch jailbreaks that look benign message by message.

DEFENSE CRITICAL NEW

GhostLock kernel container escape breaks the agent sandbox assumption

A 15-year-old Linux futex use-after-free disclosed on 8 July 2026 gives an unprivileged local user root and escapes containers — the exact isolation layer most agentic code-execution sandboxes lean on.

2026-07-14//7 min

Your guardrail announces itself: fingerprinting defenses from the outside

A July 2026 paper shows that a separate guardrail leaks its presence, its blocked categories, and whether it — not the model — refused, using only HTTP, wording, and timing signals from black-box access.

Stopping sensitive data from leaking into third-party LLM chats

A July 2026 paper builds an open-source, client-side firewall that intercepts prompts before they reach ChatGPT, Claude or Copilot and blocks PII, secrets and proprietary code from leaving.

Gating a pentest agent's calls before they run: what a scope judge needs to see

A July 2026 benchmark shows a cheap LLM judge can catch out-of-scope tool calls from offensive-security agents — but only if it sees the user's request, not a static policy alone.

Auditing agent token flows before they reach privileged sinks

A July 2026 paper reframes persistent-agent security around natural-language token flows, inspecting memory writes, tool arguments and retrieved content at the boundary before they mutate state.

Catching rogue agents by reading their activations, not their messages

A July 2026 preprint argues that watching what multi-agent systems say misses stealthy attacks. Reading each agent's internal activation states detects compromise even when the messages look benign — and repairs the agent instead of isolating it.

Attribution graphs: diagnosing why a jailbreak works inside the model

A July 2026 paper compares a model's internal computation graphs on paired safe and jailbreak prompts to find the causal circuits behind a bypass, then intervenes on them to harden the model.

Command denylists are the wrong defense for terminal AI agents

A June 20, 2026 Ohio State study ran 1,709 real-world agent command denylists through an automated bypass finder and found 69–98.6% fail to block the operations they claim to stop.

Prompt instructions aren't an enforcement layer for enterprise agents

A July 2026 study shows prompt instructions can't reliably enforce an enterprise agent's output and trace contracts — only code-owned enforcement around the model kept both safety and full utility.

Agents can't verify authority: the case for off-host tool authorization

A July 2026 paper shows model-side refusal is unreliable — 38% to 100% across 15 models — and argues authorization for tool calls belongs outside the agent, bound to verified identity.

Turning the MCP description field into a shield for taint-style server flaws

A July 2026 paper finds taint-style bugs dominate MCP server vulnerabilities and get patched slowly — then proposes hardening the tool description itself so the model refuses the dangerous call.

Attention as the battleground for RAG poisoning: steer it, or read it

A single poisoned passage can hijack a RAG answer by capturing the model's attention. New work turns that same attention into a detection signal — and a way to wall documents off from each other.

2026-07-09//7 min

AutoSpec: teaching agent safety rules to fix their own false positives

Hand-written agent guardrails are either too strict or too loose. A late-June 2026 paper uses inductive logic programming to evolve those rules from labelled examples, cutting false positives up to 94% while staying auditable.

BraveGuard: teaching a guard model to watch an agent's whole trajectory

A June 2026 paper argues static safety filters miss computer-use agent harm, and trains a guard model on open-world threats and real execution traces — raising trajectory detection from 39% to 82%.

Windows Execution Containers: OS-level isolation for autonomous agents

Microsoft's June 2026 MXC SDK moves agent containment into Windows itself — process and session isolation, per-agent identity and runtime policy for code-executing agents.

Provably robust RAG: aggregating retrieved passages to survive poisoning

A May 2026 paper proposes PRA-RAG, a retrieval-aggregation defense with theoretical robustness bounds that cuts corpus-poisoning success rates to as low as 1% while keeping 71% accuracy.

Reading an agent's tool-use intent before it acts: pre-action probes

A June 2026 paper reads two signals — is a tool needed, and how risky is it — straight from an agent's activations before execution, turning post-hoc logs into a pre-action oversight layer.

AgentFlow: static analysis that finds prompt-to-tool risks in agent code

A July 2026 paper builds a dependency graph for LLM agent programs across five frameworks, generates an Agent Bill of Materials, and flags 238 taint-style prompt-to-tool risks in real code.

2026-07-07//6 min

AgentLens: catching unsafe coding-agent steps inside the model's activations

A late-June 2026 paper proposes a white-box defense that reads a coding agent's own hidden states to flag harmful execution steps mid-task, then steers them out through a tiny activation subspace.

2026-07-07//7 min

Contextual state continuity: verifying an agent's memory before it acts

A July 2026 paper proposes a defense that recomputes and checks a cryptographic digest of an agent's tool state and memory before every query, catching tool and memory poisoning that biases behaviour silently.

2026-07-07//6 min

Untrusted Content Masking: a provable injection defense for web agents

A July 2026 paper restores the trust boundary web agents lose when they read a rendered page — masking untrusted DOM regions and routing them through a type-constrained model to block injection by construction.

2026-07-07//7 min

Why a 0.998 AUC probe may not actually detect prompt injection

A June 2026 study shows a hidden-state probe can score AUC 0.998 at flagging indirect prompt injection in computer-use agents while learning surface artefacts — and proposes controls to tell real detection apart.

kNNGuard: a training-free guardrail read from LLM activations

A July 2026 paper builds a prompt guardrail from just 50 labeled examples by reading a model's own hidden activations — no fine-tuning, and 2.7x faster than the best comparable classifier.

MAGE: a shadow memory that catches long-horizon agent attacks

A May 2026 paper borrows the shadow-stack idea from systems security to give LLM agents a parallel security memory, cutting a 100% multi-turn attack to 8.3%.

OWASP AISVS 1.0: a testable checklist for verifying AI application security

OWASP shipped the first stable release of its AI Security Verification Standard in late June 2026 — 14 chapters of pass/fail requirements that turn AI governance intent into evidence, including dedicated agent and MCP chapters.

SUDP: letting agents act on your credentials without ever holding them

A May 2026 protocol reframes agent credential handling: instead of putting a reusable secret inside the model-steerable runtime, the agent only proposes an operation the user signs off on, single-use.

AI-Infra-Guard: why agent red teaming needs one method per layer

A framework released on 30 June 2026 argues the agent attack surface is stratified — infrastructure, tools, behavior, model — and no single detection method fits all four.

2026-07-05//6 min

Stopping infectious jailbreaks in multi-agent systems with local purification

In a network of multimodal agents, one poisoned image can spread a jailbreak agent-to-agent until most of the system is compromised. A May 2026 paper proposes a training-free, per-agent cure.

2026-07-05//7 min

Stopping a compromise before it spreads across a multi-agent system

Most multi-agent defenses detect a bad agent and isolate it after the fact — by then the damage is done. A June 2026 paper simulates each message's impact before it propagates, and rewrites the risky ones.

2026-07-05//6 min

Agent Zero Trust: what Anthropic's framework fixes, and what it can't

Anthropic's May 2026 Zero Trust framework reshapes enterprise agent security around per-task identity and memory integrity — but Gartner warns it still can't fully secure high-autonomy agents.

AgentWatch: an open framework for auditing how safely browser agents behave

A UC Berkeley capstone audited five leading AI browsing agents across five risk dimensions and released an open, stochastic-aware scoring framework anyone can extend.

One filter is not enough: a layered defense for RAG chatbots

A mid-June 2026 paper argues single-stage prompt-injection filters leave gaps a poisoned knowledge-base document walks through, and tests a three-layer pipeline that drops attack success from 71% to 11%.

Locate-and-Judge: attention-based detection of malicious agent skills

A June 2026 paper scans about 134,000 agent skills across three marketplaces and confirms 131 live malicious ones, using instruction-following attention to surface payloads hidden inside benign-looking skill files.

MDASH: multi-model agentic vulnerability discovery reaches production defense

Microsoft's MDASH harness orchestrates 100+ specialized AI agents to find, debate and prove kernel bugs. It surfaced 16 Windows CVEs and scored 88.45% on CyberGym — the defensive signal, and the dual-use one.

2026-07-04//7 min

Safety token regularization: keeping fine-tuned LLMs aligned

An April 2026 paper shows benign fine-tuning quietly erodes an LLM's refusals, and proposes a lightweight logit-space regularizer that preserves safety without hurting task accuracy.

Where the instruction hierarchy breaks in reasoning models

A June 2026 diagnostic paper decomposes instruction-hierarchy failures in reasoning LLMs into three stages — and shows training-free self-monitoring can repair most of them.

MemAudit: forensic auditing to find poisoned entries in agent memory

Most agent-memory defenses try to block poisoning up front. A May 2026 paper flips the problem: audit the memory store after the fact, tracing a bad action back to the entries that caused it.

Argument-level provenance stops injection where whole-call defenses fail

A May 2026 paper argues indirect injection only turns dangerous when untrusted data binds an authority-bearing argument. PACT checks provenance per argument, recovering utility at full security.

2026-07-03//7 min

Task-alignment reasoning beats pattern-matching against adaptive prompt injection

A June 2026 paper shows static benchmarks overstate injection defenses: adaptive attackers lift the worst-case success rate by ~16 points. RETA anchors decisions on the user's task instead of the attacker's text.

2026-07-03//7 min

SCOUT: adaptive detector allocation for prompt-injection defense

Posted to arXiv in May 2026, SCOUT reframes prompt-injection defense as a per-request routing problem — reportedly cutting attack success 46% and latency 40% versus an always-on LLM judge.

TRACE: catching RAG corpus poisoning by following token influence

A July 2026 paper detects poisoned documents in a RAG corpus by tracing which retrieved tokens drove the model's answer — no extra classifier or second LLM, and it surfaces the attacker's target answer as a side effect.

Sharing prompt-injection intel across LLM services without sharing prompts

A SaTML 2026 paper from Microsoft turns detected injection prompts into privacy-preserving binary fingerprints, so one service can warn another about an attack without exposing raw user text.

When injections speak the document's language: the camouflage detection gap

Two 2026 studies show prompt injections written in a document's own domain jargon slip past guard classifiers — Llama Guard 3 caught zero. Paraphrasing retrieved content is the defense that holds up best, but results swing by model.

Harness vs. model: benchmarking LLMs on access-control bug detection

A June 2026 Semgrep benchmark on IDOR detection found an open-weight model beating a frontier coding agent on a bare prompt — but a purpose-built harness still led. What defenders should take away.

Memory laundering defeats content- and lineage-based agent memory defenses

A June 2026 paper proves any defense that bases a memory item's authority on its content or its derivation history can be laundered — and that only write-time origin binding stops agent memory poisoning.

Out-of-band injection defenses haven't met an adaptive attacker yet

A June 2026 paper warns that reference-monitor defenses like CaMeL and Progent are still judged on static benchmarks — the exact method that made in-band defenses look strong until adaptive attacks broke them.

2026-07-02//7 min

A certified defense for the RAG memory a poisoned agent never forgets

A June 2026 paper models multi-session memory poisoning — where one crafted memory quietly corrupts every future user — and offers the first defense with a provable robustness bound instead of a heuristic filter.

Cognitive Firewall: a split-compute defense for browser agents

A March 2026 eBay paper layers an on-device sentinel, a cloud planner and a deterministic execution guard to cut indirect prompt injection in browser agents from 100% to under 1%.

2026-06-22//6 min

MemMark: attributing a poisoned agent memory from the snapshot alone

A May 26, 2026 arXiv paper embeds ownership into an agent's latent memory-write decisions, so provenance survives even when logs are erased and only the final memory snapshot remains.

2026-06-22//6 min

DeepMind's AI Control Roadmap: defense-in-depth for misaligned agents

Google DeepMind's AI Control Roadmap (June 2026) treats internal AI agents as potential insider threats, layering trusted-supervisor monitoring on top of model alignment.

Backdoor unlearning generalizes: removing one trigger can suppress others

A June 2026 paper shows that teaching an LLM to ignore one backdoor trigger can also weaken other, never-targeted backdoors — when their internal activation shifts are close, measured by a new metric called CASD.

Defensive misdirection: why blocking automated jailbreaks can backfire

A June 2026 paper models the attacker's automated judge and shows that predictable refusals feed the search loop — proposing controlled misdirection instead of plain blocking.

LLM salting: rotating the refusal direction to break jailbreak reuse

SophosAI's 'LLM salting' (CAMLIS 2025) applies a small rotation to a model's refusal direction so that a jailbreak precomputed against the base model no longer transfers to your deployment — the rainbow-table defense, applied to LLMs.

Why agent refusals fail: the Cybersecurity Refusal Framework

A new benchmark shows agent safety refusals key off the URL string, not the real target. Two trivial tricks — fake 'rules of engagement' and localhost proxying — flip refusal into compliance on production sites.

2026-06-20//6 min

MCP security: stop asking which attacks exist, ask where defenses must live

An April 2026 arXiv paper maps MCP attacks across six architectural layers and finds defenses are uneven and disproportionately tool-centric — leaving host orchestration, transport and supply-chain layers structurally under-defended.

2026-06-20//7 min

Localizing prompt injection: from detection to forensic excision

Detecting a prompt injection only tells you something is wrong. Two 2026 papers, PromptLocate and WebSentinel, pinpoint exactly which span of context is poisoned so it can be excised and the task recovered.

2026-06-20//6 min

SEAgent: mandatory access control to contain agent privilege escalation

A January 2026 paper reframes agent attacks as privilege escalation — actions exceeding the least privilege a task needs — and proposes SEAgent, a deterministic MAC/ABAC layer that enforces policy over an information-flow graph.

2026-06-20//6 min

AuthGraph: dual-graph alignment to catch agent prompt injection

A May 26, 2026 UCLA paper compares a clean authorization graph against the agent's actual provenance graph, cutting AgentDojo attack success from 40% to 1%.

2026-06-19//6 min

Cordon: transactional containment for tool-using LLM agents

A June 16, 2026 arXiv paper proposes 'semantic transactions': a runtime that stages an agent's irreversible tool effects and validates the whole task flow before any commit.

2026-06-19//6 min

DoubtProbe: catching jailbreaks that reorganize intent

A June 2026 paper proposes an inference-time defense that treats jailbreak detection as a consistency check: rebuild the request under structural constraints, then flag the prompts whose meaning won't survive the round-trip.

2026-06-18//5 min

SafeMCP: look-ahead tool gating against power-seeking in MCP agents

A June 1, 2026 arXiv paper (ACL 2026) proposes SafeMCP, a server-side plugin that uses world-model look-ahead to filter hazardous tool acquisition before an MCP agent over-expands its powers.

2026-06-18//6 min

SkillVetBench: an LLM-as-Judge that catches what skill scanners miss

A June 14, 2026 arXiv paper shows code-layer skill scanners miss 89–100% of instruction-layer threats, while an LLM-as-Judge flags all 78 malicious test skills with zero false positives.

2026-06-18//6 min

The lethal trifecta is now the default — defend agents at runtime

The lethal trifecta once flagged risky agents. By mid-2026 it describes every useful one, so architecture-level avoidance no longer works. Defense shifts to five runtime behavioral signals.

2026-06-18//6 min

Dummy backdoors: removing unknown LLM backdoors via shared internal mechanisms

A June 2026 paper removes hidden backdoors you can't see by planting one you can: different backdoors share internal activation patterns, so deleting a controllable 'dummy' weakens the unknown one too.

2026-06-17//6 min

Detecting attacks in agent tool-call traffic: content beats graph

A May 2026 arXiv study of MCP tool-call monitoring finds content embeddings drive detection (AUROC > 0.89), graph structure adds little, and naive random splits inflate scores by up to 26 points.

2026-06-17//6 min

RUBAS: rubric-based RL gives agent safety a fine-grained reward signal

A June 2026 paper replaces coarse refuse/comply rewards with four scored rubrics — tool-use, argument, response and helpfulness — to train tool-calling agents that stay safe without losing utility.

2026-06-17//5 min

SkillGuard: a permission framework that governs what an agent skill can do at runtime

A June 2026 paper closes the gap between what a skill injects into an agent's context and what it makes the agent do, using manifests, deny-by-default access control and runtime monitoring.

2026-06-17//6 min

Provenance defenses for agent graph memory are blind by construction

An arXiv paper dated June 10, 2026 shows provenance checks on LLM graph memory can be bypassed without forging a single source: untrusted structure reroutes which authenticated facts get selected, and information-flow control never sees it.

Agent privacy is a trajectory problem: OCELOT budgets inference leakage at runtime

An arXiv paper dated June 10, 2026 reframes LLM-agent privacy as posterior-risk control: not filtering each output, but budgeting how much an adversary's belief about a secret may improve across a whole trajectory.

Parallax: putting agent safety in the architecture, not the prompt

A position paper published April 14, 2026 argues prompt-level guardrails fail the moment an agent's reasoning is compromised, and proposes structurally separating the part that thinks from the part that acts.

2026-06-16//7 min

Architecting secure agents: a plan-and-policy defense against prompt injection

An NVIDIA position paper (March 31, 2026) argues that indirect prompt injection cannot be fixed at the model alone — and proposes a plan-and-policy system architecture that constrains what an agent may observe and decide.

Verified agent skills: capability governance for the SKILL.md supply chain

NVIDIA's May 19, 2026 verified agent skills add risk scanning, cryptographic signing and machine-readable skill cards to the SKILL.md supply chain — a defensive answer to poisoned skills.

Confidential Computing for Agentic AI: what enclaves can't protect

A May 2026 survey maps confidential computing onto the agentic stack — hardware enclaves can shield agent memory and KV caches from a malicious cloud operator, but they cannot stop prompt injection.

Why jailbreaks transfer between models — and how salting fights back

A study of 20 open-weight models finds jailbreak transfer comes from shared internal representations, not safety-training quirks. A defense called LLM salting rotates the refusal direction to break reuse.

Prompt injection is unsolved — so contain it at machine speed

At Infosecurity Europe 2026, OWASP's Ariel Fogel called prompt injection an unresolved architectural problem and argued defenders must shift from prevention to runtime containment that runs as fast as the agent.

Why prompt-injection detectors keep failing: the evasion problem in 2026

From keyword classifiers to activation-based drift probes, prompt-injection detectors share one weakness: an adaptive attacker. Two studies report up to ~100% evasion. Treat detection as one layer, never the boundary.

SafeHarbor: a hierarchical memory guardrail that targets agent over-refusal

Accepted at ICML 2026, SafeHarbor is a training-free guardrail that injects context-aware safety rules from a self-evolving risk tree — keeping 63.6% benign utility on GPT-4o while refusing over 93% of attacks.

SecureClaw: a dual-boundary defense for tool-using LLM agents

A June 2026 paper proposes guarding two distinct boundaries at once — authorizing external actions at the effect sink and confining plaintext at the read boundary — reporting 0% attack success on one agent benchmark.

2026-06-14//6 min

PI-Hunter: auditing agents to expose and localize hidden prompt injections

A June 2026 paper from Google researchers reframes prompt-injection red-teaming as auditing — PI-Hunter evolves source-aware test cases to surface where latent injections enter and propagate through an agent, not just whether an attack lands.

2026-06-13//6 min

AgentDyn: why injection defenses that ace static benchmarks fail in the wild

A February 2026 ICML benchmark, AgentDyn, runs ten leading prompt-injection defenses on dynamic, open-ended agent tasks. Almost all are either insecure or over-defend into uselessness.

The Defense Trilemma: why prompt-injection wrappers can't be complete

A Lean 4-verified April 2026 proof shows no continuous, utility-preserving input wrapper can block every prompt injection. Continuity, utility, and completeness cannot all hold at once.

2026-06-12//7 min

Inside GitHub Agentic Workflows: a security architecture for CI/CD agents

GitHub Agentic Workflows reached public preview on June 11, 2026 with a security-first design: zero-secret agents in a chroot jail, a workflow firewall, staged-and-vetted writes, and a threat-detection job. The defensive answer to prompt injection in CI/CD.

2026-06-12//7 min

The Recuse Signal: a robots.txt for agents that hold real credentials

A June 2026 paper proposes an in-band 'deny' signal — emitted over an SSH banner or a PostgreSQL NOTICE — that politely asks an autonomous agent to withdraw. In a pilot it induced 100% recusal, but an authorization framing flipped the strongest model right back.

Tool stream injection: why static agent defenses break, and what verify-before-commit fixes

A January 2026 paper, VIGIL, reframes indirect injection around the tool stream — forged tool descriptions and fake error messages — and shows that the better-aligned an agent is, the more it obeys them.

TRUSTDESC: deriving tool descriptions from code to defuse tool poisoning

An April 2026 paper attacks tool poisoning at its root: generate a tool's description from its implementation instead of trusting the author-supplied text, neutralising implicit poisoning that detectors miss.

CASA: task-based access control that checks tool calls against the user's real intent

A May 4, 2026 arXiv paper proposes Continuous Agent Semantic Authorization — a zero-trust layer that extracts a user's task from a multi-turn chat and denies tool calls that don't match it.

2026-06-11//6 min

Oversight has a capacity: when more agent approvals make you less safe

A June 8, 2026 arXiv paper models the human reviewer behind an agent's approval gate as a fatiguing, finite resource — and shows that escalating more actions can lower realized safety and open a flooding attack.

2026-06-11//7 min

ADR: detection and response for MCP agents, proven at Uber scale

A May 2026 paper from Uber describes a production EDR-style system for MCP agents: full causal telemetry, two-tier detection, and offline red-teaming, running on 7,200+ hosts for ten months.

Agent Security Is a Systems Problem: Treat the Model as Untrusted

A May 2026 position paper from Google, UCSD and UW–Madison argues agent security must move out of the model and into the system: treat the LLM as an untrusted component and enforce invariants around it.

2026-06-08//8 min

AgentTrust: vetting agent tool calls before they execute

A preprint from May 6, 2026 introduces AgentTrust, a runtime layer that vets each agent tool call before it runs and returns allow/warn/block/review — catching obfuscated shell payloads static guards miss.

Catching model extraction by watching the whole traffic window, not single queries

A June 2026 paper shows a simple distribution test (MMD over query embeddings, calibrated on benign traffic only) detects LLM model-extraction campaigns hidden in mixed API traffic — 0.3% false positives, 100% on pure-attacker streams.

ePCA: replacing semantic agent guardrails with formal verification

A May 2026 paper proposes ePCA, a guardrail that compiles each agent action into first-order logic and runs an SMT check before execution, blocking unsafe steps as logical deadlocks.

Microsoft's agentic failure-mode taxonomy v2.0: zero-click human-in-the-loop bypass

Microsoft's AI Red Team v2.0 taxonomy (June 4, 2026) adds seven agentic failure modes and reports human-in-the-loop bypass as the most consistently exploited — including zero-click chains from a single external input.

2026-06-07//7 min

AgentVisor: an OS-hypervisor pattern that audits every agent tool call

An April 27, 2026 arXiv paper borrows the OS hypervisor idea to defend tool-using LLM agents: a trusted 'visor' audits every tool call and is architecturally blind to untrusted content.

2026-06-07//7 min

Need to Know: contextual-integrity query rewriting for LLM delegation

A June 2, 2026 arXiv paper recasts privacy-preserving query rewriting as a contextual-integrity problem: forward a span to a cloud LLM only if the task needs it, not because a PII type matched.

Two methodology traps that inflate prompt-injection detector scores

A June 1, 2026 arXiv preprint shows most prompt-injection and jailbreak detector benchmarks lean on per-dataset threshold tuning and undisclosed operating points — two habits that quietly inflate the accuracy you buy.

Membrane: contrastive safety memory that adapts guardrails without retraining

A June 4, 2026 arXiv paper proposes Membrane, a self-evolving guardrail that pairs each blocked attack with a near-identical benign request, cutting over-refusal to 7-14% while topping F1 on six jailbreaks.

OpenAI Lockdown Mode: cutting the exfiltration leg of prompt injection

On June 6, 2026 OpenAI extended Lockdown Mode to personal and self-serve Business ChatGPT accounts: a deterministic setting that disables outbound paths attackers use to exfiltrate data via prompt injection.

THRD: a training-free temporal defense against multi-turn jailbreaks

A June 2026 paper argues multi-turn jailbreaks must be judged across the whole conversation, not turn by turn. THRD scores accumulated risk over time and cuts attack success to 0.2–4% without retraining.

The agent that writes its own logs: why self-reported agent audit trails can't be trusted

If a compromised agent produces its own activity log, it can omit, alter, or fabricate what it did. Three June 2026 efforts — arXiv's Notarized Agents, an IETF agent-audit-trail draft, and SCITT — converge on the same fix: move the trust boundary off the agent.

2026-06-05//6 min

When embedding-based defenses fail in LLM multi-agent systems

A May 1, 2026 arXiv paper shows that detectors which prune malicious agents by message embedding collapse when attackers craft near-benign text — and proposes token-confidence signals as a more robust replacement.

2026-06-05//6 min

Catching credential exfiltration in LLM agents before the output token

Published June 2, 2026, an arXiv paper detects agent credential leaks before any output token is emitted — combining activation probes, calibrated honeytokens, and multi-turn leakage accounting.

2026-06-04//7 min

AgentShield: catching compromised agents with honeytokens and decoy tools

A May 2026 paper turns deception engineering on tool-using LLM agents: fake tools, fake credentials, and parameter allowlists that a hijacked agent trips over. It reports 90.7–100% detection of successful attacks with zero false alarms.

Hybrid BM25 + vector retrieval cut gradient-guided RAG poisoning from 38% to 0%

A March 10, 2026 arXiv preprint shows that adding sparse BM25 alongside dense retrieval blocks an entire class of gradient-optimized RAG corpus poisoning — without touching the LLM.

OWASP Agent Memory Guard: a runtime layer against agent memory poisoning

Covered by Help Net Security on June 1, 2026, OWASP's Agent Memory Guard is the first reference implementation for ASI06 — a drop-in layer that screens every agent memory read and write against a YAML policy.

PISmith: adaptive RL red-teaming keeps breaking injection defenses

A March 2026 paper trains an attacker model with reinforcement learning to stress-test prompt-injection defenses in a black-box setting — and 8 state-of-the-art defenses still fall, including on AgentDojo and InjecAgent.

Agent Threat Rules: a "Sigma for AI agents" — and what its recall numbers admit

ATR ships open YAML detection rules for agent attacks, now running at Microsoft, Cisco and Gen Digital. Its own benchmarks show why regex detection is a layer, not a perimeter.

2026-06-03//6 min

DataShield: when benign fine-tuning quietly erodes a model's safety

A May 29, 2026 arXiv paper shows fine-tuning an aligned LLM on harmless data still degrades its safety, and proposes DataShield to flag the samples responsible before training.

2026-06-03//6 min

SnapGuard: catching prompt injection in what the agent sees, not what it parses

An April 2026 paper proposes a lightweight detector for screenshot-based web agents, where text-centric guards are blind. It reads the rendered pixels — gradient stability plus polarity-reversed text — at 1.81s per page.

2026-06-03//6 min

Dynamic separators: hardening Polymorphic Prompt Assembling against injection

A May 28, 2026 arXiv paper fixes a blast-radius flaw in Polymorphic Prompt Assembling by generating a unique SHA-256 separator per request, cutting one payload's attack success rate from 0.88 to 0.38.

2026-06-02//6 min

Stop scoring jailbreak defenses on attack success rate alone

A May 2026 IEEE S&P paper argues that attack success rate — the field's default metric — hides how jailbreak defenses actually behave. Its Security Cube evaluates them across several axes at once.

2026-06-02//6 min

Causal attribution: an emerging defense against indirect prompt injection

A cluster of early-2026 papers — CausalArmor and AttriGuard — defends tool-calling agents by asking which actions are causally driven by untrusted content rather than by the user. A look at the causal-attribution line of defense.

2026-06-01//6 min

The guardrail trade-off triangle: prompt-injection defenses for LLM tutors

A May 2026 benchmark of prompt-injection defenses for educational LLM tutors puts numbers on a hard truth: no single guardrail wins robustness, usability and latency at the same time.

2026-06-01//6 min

Jailbreaks leave a trace: detecting attacks in LLM internal activations

A February 2026 paper and a March 2026 follow-up show jailbreak prompts carve a distinguishable signature into a model's hidden activations — enabling inference-time detection without fine-tuning or an auxiliary judge model.

2026-06-01//6 min

MCP needs a trust handshake: attested tool-server admission

A May 22, 2026 arXiv paper proposes mcp-attested — a backward-compatible MCP extension that gates tool dispatch on signed clearance, deny-by-default allowlists, and tamper-evident audit logs.

2026-05-29//6 min

One million exposed AI services: what the Intruder scan actually found

On May 5, 2026, Intruder published the results of an internet-wide scan that mapped 1 million exposed AI services across 2 million hosts. The recurring failure is not exotic — it is permissive defaults.

2026-05-29//7 min

WARD: a co-evolved guard model that holds up against adaptive prompt injection on web agents

A May 14, 2026 NUS paper proposes WARD — a guard model trained against a memory-driven adversarial attacker — and reports near-perfect out-of-distribution recall on web-agent prompt injection.

2026-05-29//7 min

Project Glasswing: 10,000+ critical bugs found by Claude Mythos in a month

Anthropic's May 26, 2026 update on Project Glasswing reports that ~50 partners have used Claude Mythos Preview to find more than 10,000 high/critical-severity vulnerabilities, including 271 latent bugs patched in Firefox 150 — and lays out a controlled-access model for a frontier offensive capability.

2026-05-26//7 min

Agents Rule of Two: Meta's pragmatic answer to unsolved prompt injection

Published Oct 31, 2025 by Meta and re-adopted in Databricks' May 2026 guide, the Agents Rule of Two limits any agent session to two of three risky properties — the most actionable framework while prompt injection remains unsolved.

2026-05-25//6 min

ARGUS: a provenance-graph defense for context-aware prompt injection

Published May 5, 2026, the ARGUS paper introduces influence-provenance auditing for LLM agents — dropping attack success from 28.8% to 3.8% on a new context-aware injection benchmark.

2026-05-22//7 min

The Instruction Hierarchy: training LLMs to rank privileged instructions

OpenAI's 2024 paper proposes a structural defense against prompt injection: teach models that system > user > tool output. The idea is now central to GPT-4o-mini and o-series safety training.

2026-05-22//7 min