AGENTS MEDIUM NEW

NRT-Bench: multi-turn red-teaming of LLM agents that run a plant

A June 18, 2026 benchmark puts LLM operator agents in a simulated nuclear control room. Adaptive multi-turn attacks pushed the team past a safety limit in 8.7-12.1% of sessions — and the failures barely overlap across models.

2026-06-20 // 6 min affects: llm-agents, operator-agents, safety-critical-systems

What is this?

On June 18, 2026, Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim and Haon Park posted NRT-Bench (arXiv:2606.20408, cs.CR/cs.AI), a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system. The setting is a simulated nuclear power plant control room — chosen not because the threat is reactor sabotage, but because it gives the authors a system with hard, objective safety limits that an agent can be talked into violating.

The contribution is a measurement venue, not an attack. LLM agents are increasingly proposed as supervisory components for industrial and safety-critical control, yet their robustness under sustained, adaptive adversarial pressure is poorly characterised. Most jailbreak benchmarks score a single turn and let a judge model decide whether the output was “harmful.” NRT-Bench does neither: it runs a full operator team over many turns and defines harm as an objective physical signal.

How it works

The simulated plant is governed by six critical safety functions (CSFs). A five-role operator team, each role backed by a configurable LLM, runs the plant. Adversaries inject messages over four channels in bounded multi-turn sessions, with per-turn feedback so the attack can adapt to how the operators respond.

The harm signal is the part worth copying. Rather than asking an LLM judge to rate a transcript, a run terminates the moment any CSF is lost, and the loss is attributed to the specific message that caused it:

Operator team (5 roles, each an LLM)
        │  runs the plant under 6 critical safety functions (CSFs)
        ▼
Adversary ──► 4 injection channels ──► multi-turn session (per-turn feedback)
        │
        ▼
Termination: a CSF is lost  ──►  harm = objective event, attributed to the causing message

This is a benchmark, so there are no operational payloads to reproduce here. The interesting design choices are methodological: multi-turn (the attack persists and adapts, like the multi-turn jailbreaks studied in MultiBreak and LITMUS), team-based (five interacting roles, not a single chatbot), and objectively scored (a physical safety function either holds or it does not).

Why it matters

Three findings stand out, and each has a defensive reading.

The authors evaluated four frontier operator models under a fixed-attack, paired-replay protocol. Across all four, between 8.7% and 12.1% of attack sessions ended with the plant losing a critical safety function. A one-in-ten failure rate under adaptive pressure is the headline number for anyone considering an LLM as a supervisor of a process with real safety limits.

The second finding is sharper. The four models look almost equally robust by that aggregate rate — but their failures barely overlap. Of 149 sessions, none defeated all four models, while a third defeated at least one. Vulnerabilities are nearly disjoint across models rather than nested. Swapping to a “more robust” backbone does not inherit the previous model’s resistance; it trades one attack surface for a different one. This echoes the cross-model picture in agent-human interaction security: robustness is not a single scalar you can shop for.

The third finding undercuts a common assumption about defences. The effect of adding a guardrail stack or a safety-advisor agent was strongly model-dependent: the same defence that lowered attack success for one model raised it for another. Defences do not compose monotonically — a result consistent with prior work showing that agent defences don’t compose cleanly.

The framing matters. This is the agentic, multi-turn version of the architectural problem OWASP put at the centre of its June 11, 2026 State of Agentic AI Security report: a model has no reliable way to separate trusted operator instructions from injected data, and when the agent is wired to a system that can lose a safety function, that confusion has physical consequences.

Defenses

NRT-Bench is a tool for finding weaknesses, so the defensive takeaways are about how you evaluate and deploy operator agents.

Score against objective state, not a judge model. If an agent supervises a system with measurable safety limits (a process variable, a rate, an interlock), make harm an objective event — “limit crossed” — attributed to the causing input. LLM-judged transcripts miss exactly the slow, multi-turn manipulations NRT-Bench was built to catch.
Red-team multi-turn, with feedback. Single-turn refusal tests over-state robustness. Adaptive sessions that see how the operator reacts and adjust are what pushed teams past the limit here. Borrow the paired-replay idea: run the same attack against each candidate model to compare like for like.
Don’t treat a “more robust” model as a drop-in upgrade. Because failures are nearly disjoint, re-run your full red-team suite on every backbone change. A model that resists your current corpus may fail a different, equally cheap attack.
Validate defences per model — they don’t compose. A guardrail or safety-advisor that helps one backbone can hurt another. Measure each defence against each model in your stack rather than assuming additive protection.
Keep humans on the irreversible actions. Where an agent can drive a system toward losing a safety function, gate the consequential steps behind human approval — the Agents Rule of Two logic applied to physical safety. Per-turn feedback to an adversary is most dangerous when the agent can act without a confirmation loop.
Reproduce before you trust. The authors release the simulation venue, attack dataset, and replay tooling. Use them as a regression suite for operator agents rather than as a one-off score.

Status

Item	Reference	Date	Notes
NRT-Bench paper	arXiv:2606.20408 (cs.CR)	2026-06-18	Multi-turn red-teaming of operator agents, CC BY 4.0
Failure rate	NRT-Bench	2026-06-18	8.7%–12.1% of attack sessions lose a CSF, across 4 models
Disjoint failures	NRT-Bench	2026-06-18	149 sessions: none beat all 4 models; ~1/3 beat ≥1
Model-dependent defences	NRT-Bench	2026-06-18	Same guardrail can lower one model’s risk and raise another’s
Architectural context	OWASP / Help Net Security	2026-06-11	Prompt injection unseparable from data at the token level

The right framing is not “an AI can melt down a reactor” — NRT-Bench is a simulator with an objective scoreboard. It is that putting an LLM in charge of a system with real safety limits is now measurable, and under adaptive multi-turn pressure the limits get crossed often enough, and unpredictably enough across models, that “pick a more aligned backbone” is not a defence. If you are wiring agents to anything with an interlock, score them the way this paper does before you trust them with the interlock.