CyBiasBench: offensive LLM agents keep picking the same attacks
A May 2026 benchmark logged 630 attack sessions and found that LLM agents in offensive cyber scenarios fixate on a narrow set of attack families — regardless of how you prompt them. Bias, not skill, shapes what they try.
What is this?
CyBiasBench is a benchmark released in May 2026 (arXiv 2605.07830) that asks a narrow but useful question: when you point an LLM agent at a target and tell it to attack, what does it actually try — and does that depend on the prompt, or on the agent itself?
The authors ran 630 attack sessions, pitting five agents against three targets under four prompt conditions, and watched how each agent distributed its effort across ten attack families. The headline finding is uncomfortable for anyone modelling AI-assisted attackers as flexible generalists: each agent concentrates on a narrow subset of attack families, and that subset barely moves when you change the prompt. The agents have a house style. They reach for the same techniques whether or not those techniques fit the target.
This is a measurement study, not an exploit. It tells defenders something about the behaviour of offensive agents, which is exactly the kind of finding that helps you anticipate them.
How it works
The methodology is deliberately boring, which is what makes it credible. Rather than trusting the agent’s own narration of what it did, CyBiasBench logs the raw HTTP traffic each agent generates and classifies every request with a deterministic classifier built on the OWASP Core Rule Set (CRS). Each request is bucketed into an attack family — the same taxonomy a web application firewall uses — so the measurement is reproducible and independent of the agent’s self-report.
With every request labelled, the team measured two things per agent: how its effort is spread across the ten families (the attack-family allocation distribution, summarised by its entropy), and how that spread responds when the prompt explicitly steers the agent toward a different family.
Two patterns emerged:
- Explicit bias. Agents differ in their dominant attack family and in the entropy of their allocation. Some spray across families; others collapse almost entirely onto one or two. The dominant family is a property of the agent, not the scenario.
- Bias momentum. When the prompt pushes an agent toward a family that diverges from its free-choice preference, the agent resists. Steering works least well exactly where you’d most want it to — when you’re trying to pull the agent off its favourite technique.
Crucially, the paper notes that bias is better characterised as a trait of the agent than as a driver of attack success. An agent’s preferred family is not necessarily its most effective one. The fixation is behavioural, not strategic — the agent isn’t concentrating because that’s what works, it’s concentrating because that’s what it does.
Why it matters
If you build threat models for AI-assisted intrusion, the intuitive assumption is that an LLM agent explores the full attack surface — that it’s a tireless generalist trying everything. CyBiasBench says the opposite for the agents tested: they behave more like a junior operator with a few favourite moves, and they’re hard to talk out of them.
That has two consequences. For defenders, predictable attackers are good news: if a given agent reliably leans on a small set of families, the traffic it produces is more fingerprintable than a human red-teamer’s, and detection tuned to those families catches a disproportionate share of its activity. For red teams and evaluators, it’s a warning: a single off-the-shelf agent does not give you broad coverage. If your AI-assisted assessment uses one agent, you are inheriting that agent’s blind spots, and “the agent didn’t find it” tells you about the agent’s bias, not about your target’s exposure. This connects to earlier findings on how agentic red teaming compresses timelines without necessarily broadening coverage.
It also complicates benchmark design. Leaderboards that score offensive agents on a single target distribution can reward an agent whose favourite family happens to match the test, while penalising a more balanced agent — measuring fit, not capability. This is part of why meta-benchmarks like CAIBench and task suites like Cybench matter: capability has to be read across many scenarios before you can separate it from bias.
Defenses
This is research, so the “defenses” are how to use the finding rather than patch a hole.
-
Profile the agents, not just the attacks. If adversaries are using known agents, build detection signatures around each agent’s dominant attack families. The CRS-bucketed traffic in CyBiasBench is reproducible — you can characterise an agent’s house style in your own lab and turn it into a WAF/IDS prior.
-
Don’t equate “one agent ran clean” with “we’re secure”. Coverage from a single agent is bounded by its bias. Run multiple, architecturally different agents in any AI-assisted assessment, and compare their allocation distributions to estimate the surface none of them touched.
-
Treat low allocation entropy as a coverage gap, not a result. If your red-team agent spent 80% of its requests on one family, the families it ignored are unaudited — schedule human or differently-biased follow-up there.
-
Log raw traffic, classify deterministically. The study’s core method — capture HTTP, classify with OWASP CRS, ignore the agent’s self-report — is a cheap, vendor-neutral way to audit what your agents actually do versus what they claim. Self-reported attack logs are not evidence.
-
Build bias into your threat models. When estimating AI-assisted attacker behaviour, model a biased operator with momentum, not an omniscient one. The realistic near-term attacker over-uses a few techniques and resists redirection — which makes their early-stage traffic noisier and more catchable than a skilled human’s.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| CyBiasBench paper | arXiv 2605.07830 | 2026-05 | 630 sessions, 5 agents, 3 targets, 4 prompt conditions, 10 attack families |
| Classification method | OWASP Core Rule Set | — | Deterministic per-request attack-family labelling from raw HTTP |
| Key finding | — | — | Attack-selection bias + “bias momentum”; bias is an agent trait, not a success driver |
| Related coverage | CAIBench, Cybench | 2024–2025 | Multi-scenario benchmarks for separating capability from fit |
The useful takeaway is narrow and practical: today’s offensive LLM agents are not the all-seeing generalists threat models often assume. They have habits, those habits are measurable, and measurable habits are defensible. Profile the agent, run more than one, and watch what their traffic actually does — not what their logs say they did.