DEFENSE MEDIUM NEW

Oversight has a capacity: when more agent approvals make you less safe

A June 8, 2026 arXiv paper models the human reviewer behind an agent's approval gate as a fatiguing, finite resource — and shows that escalating more actions can lower realized safety and open a flooding attack.

2026-06-11 // 7 min affects: llm-agents, ai-coding-assistants, human-in-the-loop-guards

What is this?

On June 8, 2026, Emre Turan posted Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human on arXiv (cs.AI, CC BY 4.0), with open code and data at github.com/turangenesis/headroom. It is not a new exploit. It is a measurement of a design assumption almost every agent deployment relies on without testing: the human-in-the-loop approval gate.

As LLM agents move from suggesting code to running it — deploy, rm -rf, push to main, read a secret — frameworks ship a “pause the risky action and ask a human” primitive (LangGraph’s interrupt(), and equivalents). The paper’s argument is that the gate is the easy part. The hard part is the escalation policy deciding which actions to pause, and the field judges that policy against two assumptions that are both false: that “risky” has a ground-truth label, and that the reviewer is a perfect, always-available oracle.

How it works

The author hand-labels 125 adversarially-weighted coding-agent actions and runs three measurements.

First, there is no single ground truth. Three persona reviewers labelling the same actions reach only moderate agreement (Fleiss’ κ = 0.52); the “pragmatic” persona calls 87 actions safe while the “cautious” one calls only 45 safe, with disagreement concentrated on the contested middle. A guard cannot be scored against one objective truth — the agreement ceiling is the honest yardstick.

Second, framing the guard as selective classification under asymmetric cost makes its limits measurable. Under a 50:1 blocked-miss-to-false-alarm cost, the cost-minimising policy collapses to “escalate almost everything”: reaching zero missed danger requires a near-100% false-alarm rate. On hard, ambiguous inputs the guard simply cannot safely auto-decide, so it must lean on the human.

Third — the headline result — when the reviewer is modelled as endogenous (reliable up to a capacity, then fatiguing as escalation load grows), realised safety becomes an inverted-U in the escalation rate. Escalate too little and the guard auto-allows danger; escalate too much and the overloaded reviewer rubber-stamps. The safety-optimal escalation rate sits below full escalation. In the simulation, at reviewer capacity 25 the optimum escalates 64% of actions and lets 42% of danger through, versus 57% when escalating everything. The author is explicit that this is a modelling result about a plausible fatigue curve, not a human study.

Why it matters

The same fatigue dynamic is an attack surface. An adversary buries one malicious action in a stream of benign “filler” actions; each escalated filler spends reviewer attention, so by the time the malicious call surfaces, a fatigued human clicks Approve. Under a naive paranoid policy (escalating 88% of routine actions), simulated attack success climbs toward an ~80% rubber-stamp ceiling — already 40% at just 50 filler actions. A load-aware policy that escalates only 26% keeps the reviewer fresh and holds attack success at 0% until the filler volume gets large (~2× the headroom), after which it too bends upward. The defence “buys headroom, not immortality.”

The practical reframe: agent oversight is not only a classification problem (“which action is risky?”) but a resource-allocation problem. Human attention is finite, and a policy that escalates routine actions spends the same budget the genuinely dangerous action will need. This matters for anyone running coding agents, CI-integrated agents, or any system where a human approves agent tool calls at scale.

Defenses

Treat escalation rate as a tunable, not a maximum. “Escalate everything” is not the safe default; it can be strictly worse than a middle rate once reviewer fatigue is real. Measure your guard’s risk–coverage curve rather than guessing.
Adopt load-aware escalation. Suppress escalations on routine, low-risk actions so the reviewer’s attention is reserved for genuinely ambiguous ones — this both finds the inverted-U optimum and denies an attacker the filler load a flooding attack needs.
Budget reviewer attention explicitly. Cap escalations per reviewer per shift, rotate reviewers, and watch escalation queue depth as a security signal — the same alert-fatigue discipline mature SOCs already use.
Stop assuming a ground-truth “risky” label. Calibrate against inter-reviewer agreement, accept a contested middle, and pick the operating point with an explicit cost matrix you report rather than tune.
Pick the scoring model deliberately. Guard quality depends on the model behind it (the paper measures a modest Sonnet-over-Haiku gain), the threshold, and the attack mix — so re-measure when any of those change.

Status

The work is a modelling and measurement study on a single 125-action coding-agent dataset, with personas standing in for human annotators; the inverted-U and flooding results are simulated from a documented (not yet empirically fitted) fatigue curve. The author cites prior art — fatigue-aware learning-to-defer, cost-sensitive deferral under workload constraints, and SOC alert-fatigue attacks — and frames the contribution as operationalising and measuring them together for agent action-gating. Code and data are public; a human study fitting the fatigue curve is named as future work.