Agent Threat Rules: a "Sigma for AI agents" — and what its recall numbers admit
ATR ships open YAML detection rules for agent attacks, now running at Microsoft, Cisco and Gen Digital. Its own benchmarks show why regex detection is a layer, not a perimeter.
What is this?
Agent Threat Rules (ATR) is an open, versioned, machine-readable detection-rule format for AI-agent attacks — prompt injection, tool poisoning, skill compromise and context exfiltration. Help Net Security covered its move into production on June 3, 2026; the project frames itself as “Sigma for AI agents,” the way Sigma standardised SIEM detection and YARA standardised malware signatures. Rules are YAML documents that declare an attack pattern, the input field to inspect (LLM input, tool-call arguments, or SKILL.md content), and the test cases that prove the rule fires. A TypeScript reference engine and a Python wrapper, pyATR, both ship under the MIT license.
The reason to cover it is not the marketing line but the transparency. The project publishes its own recall numbers, corpus by corpus, including the ones that look bad. That honesty is exactly what lets a defender reason about where rule-based detection helps and where it doesn’t.
How it works
A rule matches events from an agent’s runtime — user prompts, tool calls, MCP exchanges, memory operations, skill installs — using regex patterns and behavioural thresholds, then declares a response (block, alert, quarantine, escalate). Because every rule ships with true-positive and true-negative test cases, the rule set is itself testable and peer-reviewable, the property classic guardrail blocklists usually lack.
The benchmark numbers are the story. Per Help Net’s reporting of ATR’s version-pinned measurements:
Corpus (version-pinned) Recall What it means
----------------------------- -------- ----------------------------------------
garak in-the-wild jailbreaks 98.0% Known, structured payloads: caught
garak (all probe families) 38.5% Broaden the attack space: most slip past
hackaprompt 66.0% Mixed human-crafted attacks: partial
AdvBench / HarmBench 1.3 / 2.5% Academic adversarial sets: near-miss
JailbreakBench 5.0%
PromptBench / PromptInject 0.0% Paraphrased / semantic attacks: blind
The maintainer, Adam Lin, addressed the gap directly: every rule in those low-scoring evaluations passed its own true-positive and true-negative tests, yet the aggregate recall is near zero. The split is structural. A regex layer matches what it can express — fixed, structured attack strings — and is blind to what it can’t: paraphrased and semantically rephrased payloads. The project documents this as a coverage gap rather than hiding it, and recommends pairing ATR with credential brokering, sandboxed execution and human review for high-risk actions.
Why it matters
Two things are true at once, and both matter for defenders.
First, agent detection is finally getting a shared vocabulary. ATR maps to 10 of 10 OWASP Agentic Top 10 categories and reports 78 of 85 SAFE-MCP techniques covered (91.8%), with individual rules referencing real CVEs in Microsoft Semantic Kernel, Spring AI, LiteLLM and Claude Code. It is already running in production: Microsoft’s Agent Governance Toolkit auto-syncs an ATR rule pack weekly, Cisco AI Defense runs one in its skill-scanner, MISP at CIRCL merged a threat-intel cluster, and Gen Digital (parent of Norton, Avast and AVG) merged a pack. A vendor-neutral, machine-readable format that several Fortune-500 tools consume is a genuine step up from every team writing its own undocumented blocklist.
Second, the recall table is a warning against treating any pattern matcher as a perimeter. 98% on known jailbreaks and 0% on paraphrased ones is the signature of regex detection everywhere: excellent on the attacks you have already seen, blind to novelty. An attacker who can rephrase — which is most of them — routes around the rule. The right mental model is innate immunity: fast, cheap, high-coverage on known patterns, and explicitly not a substitute for the slower semantic and architectural defences that catch the unknown.
Defenses
ATR is a detection layer. Deploy it as one input to a defence-in-depth stack, not as the wall.
-
Run rule-based detection on the events that matter. Wire ATR (or any conforming engine) to inspect LLM I/O, tool-call arguments and
SKILL.md/skill-install events. It is cheap, fast and catches the high-volume, structured attacks — a real reduction in noise. -
Assume the regex layer is bypassable and architect behind it. Pair detection with credential brokering, sandboxed execution and tightly scoped tokens, so that an injection the rules miss still lands in a contained blast radius. This is the maintainer’s own recommendation.
-
Gate high-impact actions on human or policy approval. Irreversible or sensitive steps — sending data, writing to production, executing code — should not depend on a pattern match having fired. Detection informs; a person or policy engine confirms.
-
Add a semantic layer for the paraphrase gap. Where regex scores 0% (PromptBench, PromptInject), an LLM-based or embedding-based classifier is the complementary control. Use the rules for the 95% of known traffic and the semantic layer for the novel tail.
-
Track the benchmark, not the headline. When evaluating any agent-security product, ask for version-pinned recall and precision per corpus — exactly what ATR publishes. A single “blocks prompt injection” claim with no corpus breakdown is unfalsifiable.
-
Contribute false positives back. The format’s value compounds with community tuning. Rules tuned for recall over precision will misfire on your workload; feeding those back is what turns a shared standard into a good one.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| ATR moves into production (coverage) | Help Net Security | 2026-06-03 | 400+ rules; “Sigma for AI agents” |
| First public release (v0.1.0) | GitHub | 2026-03-09 | 29 rules, RFC draft, MIT license |
| garak in-the-wild recall | ATR version-pinned | 2026-06 | 98.0% on known structured jailbreaks |
| garak (all families) / PromptBench | ATR version-pinned | 2026-06 | 38.5% / 0.0% — paraphrase gap |
| OWASP Agentic Top 10 coverage | ATR | 2026-06 | 10/10 categories; SAFE-MCP 78/85 (91.8%) |
| Production adopters | Help Net, project site | 2026-04 → 2026-06 | Microsoft AGT, Cisco AI Defense, MISP/CIRCL, Gen Digital |
The takeaway is not “ATR doesn’t work” — on the attacks it is built to catch, it catches them, and a shared open rule format is overdue. The takeaway is that its own honest benchmarks draw the boundary for you: rule-based detection is the fast, cheap inner layer of agent defence, and the paraphrase-shaped hole in it is precisely where your sandboxing, credential scoping and human-in-the-loop have to do the work.