NRT-Bench: multi-turn red-teaming of LLM agents that run a plant
A June 18, 2026 benchmark puts LLM operator agents in a simulated nuclear control room. Adaptive multi-turn attacks pushed the team past a safety limit in 8.7-12.1% of sessions — and the failures barely overlap across models.
What is this?
On June 18, 2026, Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim and Haon Park posted NRT-Bench (arXiv:2606.20408, cs.CR/cs.AI), a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system. The setting is a simulated nuclear power plant control room — chosen not because the threat is reactor sabotage, but because it gives the authors a system with hard, objective safety limits that an agent can be talked into violating.
The contribution is a measurement venue, not an attack. LLM agents are increasingly proposed as supervisory components for industrial and safety-critical control, yet their robustness under sustained, adaptive adversarial pressure is poorly characterised. Most jailbreak benchmarks score a single turn and let a judge model decide whether the output was “harmful.” NRT-Bench does neither: it runs a full operator team over many turns and defines harm as an objective physical signal.
How it works
The simulated plant is governed by six critical safety functions (CSFs). A five-role operator team, each role backed by a configurable LLM, runs the plant. Adversaries inject messages over four channels in bounded multi-turn sessions, with per-turn feedback so the attack can adapt to how the operators respond.
The harm signal is the part worth copying. Rather than asking an LLM judge to rate a transcript, a run terminates the moment any CSF is lost, and the loss is attributed to the specific message that caused it:
Operator team (5 roles, each an LLM)
│ runs the plant under 6 critical safety functions (CSFs)
▼
Adversary ──► 4 injection channels ──► multi-turn session (per-turn feedback)
│
▼
Termination: a CSF is lost ──► harm = objective event, attributed to the causing message
This is a benchmark, so there are no operational payloads to reproduce here. The interesting design choices are methodological: multi-turn (the attack persists and adapts, like the multi-turn jailbreaks studied in MultiBreak and LITMUS), team-based (five interacting roles, not a single chatbot), and objectively scored (a physical safety function either holds or it does not).
Why it matters
Three findings stand out, and each has a defensive reading.
The authors evaluated four frontier operator models under a fixed-attack, paired-replay protocol. Across all four, between 8.7% and 12.1% of attack sessions ended with the plant losing a critical safety function. A one-in-ten failure rate under adaptive pressure is the headline number for anyone considering an LLM as a supervisor of a process with real safety limits.
The second finding is sharper. The four models look almost equally robust by that aggregate rate — but their failures barely overlap. Of 149 sessions, none defeated all four models, while a third defeated at least one. Vulnerabilities are nearly disjoint across models rather than nested. Swapping to a “more robust” backbone does not inherit the previous model’s resistance; it trades one attack surface for a different one. This echoes the cross-model picture in agent-human interaction security: robustness is not a single scalar you can shop for.
The third finding undercuts a common assumption about defences. The effect of adding a guardrail stack or a safety-advisor agent was strongly model-dependent: the same defence that lowered attack success for one model raised it for another. Defences do not compose monotonically — a result consistent with prior work showing that agent defences don’t compose cleanly.
The framing matters. This is the agentic, multi-turn version of the architectural problem OWASP put at the centre of its June 11, 2026 State of Agentic AI Security report: a model has no reliable way to separate trusted operator instructions from injected data, and when the agent is wired to a system that can lose a safety function, that confusion has physical consequences.
Defenses
NRT-Bench is a tool for finding weaknesses, so the defensive takeaways are about how you evaluate and deploy operator agents.
-
Score against objective state, not a judge model. If an agent supervises a system with measurable safety limits (a process variable, a rate, an interlock), make harm an objective event — “limit crossed” — attributed to the causing input. LLM-judged transcripts miss exactly the slow, multi-turn manipulations NRT-Bench was built to catch.
-
Red-team multi-turn, with feedback. Single-turn refusal tests over-state robustness. Adaptive sessions that see how the operator reacts and adjust are what pushed teams past the limit here. Borrow the paired-replay idea: run the same attack against each candidate model to compare like for like.
-
Don’t treat a “more robust” model as a drop-in upgrade. Because failures are nearly disjoint, re-run your full red-team suite on every backbone change. A model that resists your current corpus may fail a different, equally cheap attack.
-
Validate defences per model — they don’t compose. A guardrail or safety-advisor that helps one backbone can hurt another. Measure each defence against each model in your stack rather than assuming additive protection.
-
Keep humans on the irreversible actions. Where an agent can drive a system toward losing a safety function, gate the consequential steps behind human approval — the Agents Rule of Two logic applied to physical safety. Per-turn feedback to an adversary is most dangerous when the agent can act without a confirmation loop.
-
Reproduce before you trust. The authors release the simulation venue, attack dataset, and replay tooling. Use them as a regression suite for operator agents rather than as a one-off score.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| NRT-Bench paper | arXiv:2606.20408 (cs.CR) | 2026-06-18 | Multi-turn red-teaming of operator agents, CC BY 4.0 |
| Failure rate | NRT-Bench | 2026-06-18 | 8.7%–12.1% of attack sessions lose a CSF, across 4 models |
| Disjoint failures | NRT-Bench | 2026-06-18 | 149 sessions: none beat all 4 models; ~1/3 beat ≥1 |
| Model-dependent defences | NRT-Bench | 2026-06-18 | Same guardrail can lower one model’s risk and raise another’s |
| Architectural context | OWASP / Help Net Security | 2026-06-11 | Prompt injection unseparable from data at the token level |
The right framing is not “an AI can melt down a reactor” — NRT-Bench is a simulator with an objective scoreboard. It is that putting an LLM in charge of a system with real safety limits is now measurable, and under adaptive multi-turn pressure the limits get crossed often enough, and unpredictably enough across models, that “pick a more aligned backbone” is not a defence. If you are wiring agents to anything with an interlock, score them the way this paper does before you trust them with the interlock.