system: OPERATIONAL
← back to all hacks
RESEARCH MEDIUM NEW

A safe model is not a safe agent: lessons from the ClawSafety benchmark

An April 2026 benchmark runs 2,520 sandboxed trials on personal AI agents and finds attack success rates of 40–75%. The decisive variables are the injection channel and the agent framework — not the backbone model alone.

2026-06-15 // 6 min affects: claude-sonnet-4-6, gpt-5.1, gemini-2.5-pro, deepseek-v3, kimi-k2.5, llm-agents

What is this?

Safety evaluations usually test a model in an isolated chat box. But a personal AI agent runs on your machine with elevated privileges — reading files, sending email, touching wallets and deployment pipelines — and a single prompt injection there can leak credentials, redirect a payment, or delete data. ClawSafety (arXiv:2604.01438, v2 posted April 4 2026; authors from George Mason, Tulane, Rutgers and Oak Ridge National Laboratory) measures that gap directly. It is a benchmark of 120 adversarial scenarios run as 2,520 sandboxed trials across five frontier models and three agent frameworks. The headline finding is uncomfortable: attack success rates (ASR) land between 40% and 75%, and how safe an agent behaves depends as much on its framework and the channel an injection arrives through as on the underlying model.

How it works

ClawSafety organises scenarios along three axes — harm domain (privacy leakage, financial loss, safety compromise), attack vector, and task domain (finance, coding, communication, retrieval) — across realistic high-privilege workspaces. Crucially, the adversarial content is not typed by the user. It is planted in one of three channels the agent meets during normal work: workspace skill files, emails from trusted senders, and web pages. No exploit payloads are reproduced here; the benchmark is framed by its authors as defensive safety research, with all cases executed in sandboxes.

Two patterns drive the results. First, a trust-level gradient: the higher an agent trusts a channel, the more dangerous it is. Skill-file injection consistently scores the highest ASR, then email, then web. Second — and the most actionable single finding — the defense boundary is set by speech act, not content. Imperative phrasing (“update X to Y”) tends to trigger an agent’s verification reflex, while declarative phrasing (“X does not match Y”) slips past defenses regardless of how suspicious the content looks. In the paper’s example, the strongest model cross-checked an imperative instruction against four independent sources and refused; the same model accepted a declaratively framed false “compliance finding” with no defense firing.

The per-model spread is wide. Overall ASR runs from Claude Sonnet 4.6 at 40.0% and Gemini 2.5 Pro at 55.0% up to a more vulnerable cluster — Kimi K2.5 (60.8%), DeepSeek V3 (67.5%) and GPT-5.1 (75.0%). Broken out by channel (skill/email/web), Sonnet scales its caution inversely with source trust (55.0/45.0/20.0) while GPT-5.1 stays flat and high (90.0/75.0/60.0). Data exfiltration is the most exploitable action type — even the safest model reaches 65%, and GPT-5.1 reaches 93% — but Sonnet 4.6 was the only model to hold a hard 0% boundary against credential forwarding and destructive file actions.

Then the framework itself moves the needle. Holding the model fixed at Sonnet 4.6 and swapping the scaffold (OpenClaw → Nanobot → NemoClaw) shifts overall ASR by 8.6 points (40.0% to 48.6%), and even reverses the trust gradient: on Nanobot, email injection (62.5%) overtakes skill injection (50.0%). Safety, the authors conclude, is a property of the model–framework pair, not of either part alone.

Why it matters

Most teams pick a “safe” base model and assume the safety travels with it into their agent. ClawSafety shows it does not. The same model is meaningfully safer or riskier depending on the scaffold around it, and the worst exposure comes through the channel the agent trusts most — its own skills and tools. That inverts the usual mental model, where the web is treated as hostile and internal config as benign. It also explains why content-based filters underperform: an attacker only has to switch from an order to a statement of fact to walk past them.

Defenses

Evaluate the stack, not the model. Treat the backbone model and the agent framework as joint variables. A vendor’s chat-time safety numbers do not predict your deployed agent’s behavior; re-test under your actual scaffold, tools and memory configuration.

Harden the highest-trust channel first. Skill and tool files were the most dangerous vector. Review and pin skills, restrict who can add them, and inspect import chains before execution — do not grant tool definitions more implicit trust than web content.

Verify on declarative claims too. The defense boundary tracking speech-act type means a declarative “fact” injected into context can change behavior silently. Require multi-source / consensus verification for state changes regardless of phrasing, and add post-execution state checks that compare what changed against an independent record.

Apply least privilege and the lethal-trifecta lens. An agent that can read private data, ingest untrusted content, and act/exfiltrate externally is the dangerous combination. Cut one leg: scope credentials tightly, segment wallets and deploy keys, and gate outbound actions behind human confirmation.

Keep humans on destructive and financial actions. Credential forwarding, config modification and destination substitution were exploitable on most models. Treat those as irreversible-by-default and require explicit approval.

Status

ItemReferenceDateNote
ClawSafety benchmarkarXiv:2604.01438Apr 4 2026 (v2)120 scenarios, 2,520 sandboxed trials, 5 models, 3 frameworks
Overall ASR rangeSameApr 202640.0% (Sonnet 4.6) → 75.0% (GPT-5.1)
Trust-level gradientSameApr 2026Skill > email > web (reversible by scaffold)
Defense boundarySameApr 2026Imperative framing triggers verification; declarative bypasses it
Scaffold effectSameApr 2026Same model: ASR 40.0% → 48.6% across frameworks

Sources