The Autonomy Tax: how defense training breaks LLM agents
A March 19, 2026 USC paper measures the cost of prompt-injection-defense training on agent competence — defended models time out on 99% of tasks, vs 13% for undefended baselines.
What is this?
On March 19, 2026, Shawn Li and Yue Zhao at the University of Southern California posted The Autonomy Tax: Defense Training Breaks LLM Agents (arXiv:2603.19423). The paper does not propose a new attack or a new defense. It measures the side effects of the defenses already shipping. Across 97 agent tasks and 1,000 adversarial prompts, defense-trained models — the ones fine-tuned to refuse prompt-injection attempts — collapse as agents. Where an undefended baseline times out on 13% of tasks, the defended version times out on 99%. The authors call this gap the autonomy tax, and trace it back to shortcut learning during defense fine-tuning.
The paper extends an earlier January 2026 result from the same group — Defenses Against Prompt Attacks Learn Surface Heuristics (arXiv:2601.07185) — which already showed that defense-tuned chat models acquire surface heuristics rather than a real understanding of malicious intent. The Autonomy Tax replays that experiment in the agentic setting, where the cost of those heuristics compounds with every step.
How it works
A typical prompt-injection defense fine-tune trains the model on pairs of (benign request, comply) and (malicious injection, refuse). In a single-turn chat that is mostly fine. In an agent loop the same model now sees long, dynamic observations: tool outputs, retrieved documents, scratchpads. The paper documents three systematic biases that emerge in that setting.
- Agent incompetence bias. Defended models refuse or emit malformed tool calls on perfectly benign tasks, before they have read a single external observation. The refusal is triggered by surface features of the task description, not by anything an attacker placed in context.
- Cascade amplification bias. Agent harnesses retry failed tool calls. A defended model that refuses once tends to refuse again on the retry, and the harness eventually times out. This is how the 13% → 99% timeout ratio is produced: a small per-step refusal rate becomes a near-certain failure across multi-step trajectories.
- Trigger bias. Defended models perform worse than undefended baselines under several attack categories. Surface triggers that the model learned during defense training (specific tokens, suffix patterns, role labels) can be inverted by adaptive attackers, while attacks that do not match those triggers slip through unchanged.
The root cause analysis ties all three biases to shortcut learning: the defended models overfit to token-level and positional patterns in the training distribution rather than to semantic threat understanding. The January preprint (Li et al., 2601.07185) characterised the same phenomenon in chat models with three other measurable biases — position bias, where benign content placed after instructions is rejected at up to 90% rates; token trigger bias, where a single trigger token raises false-refusal rates by up to 50%; and topic generalization bias, where accuracy drops 40% on benign tasks outside the defense distribution. The Autonomy Tax shows how those chat-level pathologies cascade into agent collapse.
Why it matters
The result has three implications for anyone shipping agents in mid-2026.
First, defense-tuned weights are not a drop-in replacement for an undefended base model in agent harnesses. Practitioners who swapped a base model for a defended variant to “improve safety” may have silently broken their agent’s ability to finish tasks, without any flag in their offline evals — single-turn benchmarks miss the cascade.
Second, the security gain is smaller than advertised. Both papers report that adaptive attacks bypass defended models at 95–100% success rates. Trading 87 points of task completion for a handful of percentage points against the easiest attack categories is a bad deal, and the trade is currently invisible in most vendor benchmarks.
Third, the field’s preferred evaluation methodology — static single-turn jailbreak benchmarks — does not predict agent-time behaviour. The Autonomy Tax is the latest in a string of 2026 results (alongside Cisco’s multi-turn frontier eval and UCLA’s agent-human interaction audit) showing that single-turn metrics overstate real-world safety. Agent evals need to be multi-step, with retries, with timeouts as a measured signal.
Defenses
The paper is descriptive, but the implications for defensive teams are clear.
- Do not deploy defense-tuned weights into an agent harness without re-evaluating end-to-end task success. Measure timeout rate, retry rate and refusal rate on a multi-step benchmark — not just attack success on single-turn jailbreaks.
- Prefer architectural defenses over weight-level ones for agents. Scoping, runtime approval on irreversible actions, output filtering and provenance tracking (ARGUS-style influence graphs) compose with a competent base model, instead of degrading it.
- If you must fine-tune for refusal, audit for trigger bias. Hold out benign tasks that share surface features with the defense training set (same tokens, same role labels) and confirm the model still completes them.
- Log timeout-to-completion ratios in production. A defended agent whose timeout rate climbs after a weight rollout is the same failure mode the paper documents; it is also indistinguishable from an outage if you only monitor uptime.
- Treat defense fine-tuning as a capability change, not a safety patch. It has to ship through the same eval pipeline as any other model swap.
Status
| Item | Date | Status |
|---|---|---|
| The Autonomy Tax (arXiv:2603.19423) | March 19, 2026 | Public preprint |
| Defenses Against Prompt Attacks Learn Surface Heuristics (arXiv:2601.07185) | January 2026 | Public preprint |
| Tasks audited | — | 97 agent tasks, 1,000 adversarial prompts |
| Defended-model timeout rate | — | 99% (vs 13% undefended) |
| Adaptive attack success on defended models | — | 95–100% across both papers |
| Industry uptake | Ongoing | Discussed in multi-step agent eval work (Cisco, UCLA, 2026) |
Both papers are preprints and have not been peer-reviewed at the time of writing. The empirical core — the timeout-rate gap and the three agent biases — is the part most directly useful to defenders today.