XL-SafetyBench: testing LLM safety across 10 countries, not just English
A May 7, 2026 arXiv paper from AIM Intelligence and Microsoft's AI Red Team shows English-centric safety tests miss country-specific harms — and that many models' 'safety' is refusal by accident, not genuine alignment.
What is this?
On May 7, 2026, a team led by AIM Intelligence — with co-authors from Microsoft’s AI Red Team, the Korea AI Safety Institute, KT, BMW Group, the Technical University of Munich, Ankara University and Seoul National University — published XL-SafetyBench on arXiv (2605.05662). It is a cross-cultural safety benchmark of 5,500 test cases across 10 country-language pairs: the United States, France, Germany, Spain, South Korea, Japan, India, Indonesia, Türkiye and the UAE.
The paper’s argument is simple and uncomfortable: most LLM safety evaluations are written in English and then translated. Translation moves the words, but it does not move the legal, institutional and cultural context in which a request is actually harmful. A benchmark that only speaks English measures whether a model refuses English-shaped harms — not whether it is safe to deploy in Seoul, Ankara or São Paulo. The accompanying June 5, 2026 announcement frames the gap as the “Illusion of Safety.”
How it works
XL-SafetyBench is built along two tracks. The first is a Jailbreak / Local Risk track of country-grounded adversarial prompts — requests whose harm depends on local laws, fraud patterns and platforms. The second is a Cultural Sensitivity track, where a sensitive element is hidden inside an otherwise innocuous request, testing whether the model recognises it at all.
Two examples from the authors make the point. A fraud prompt built around Korea’s jeonse lump-sum lease-deposit system is only recognisable as fraud if you understand that financial structure. And recommending chrysanthemums as a gift in France is culturally wrong — the flower is associated with death and mourning — even though nothing in the sentence is “unsafe” in the English sense.
Each item went through a multi-stage pipeline: LLM-assisted discovery, automated validation gates, and two independent native-speaker annotators per country. Crucially, the authors do not score on refusal alone. They report Attack Success Rate (ASR) alongside two new metrics: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). The split exists to separate a principled refusal (“I understand this is fraud, so no”) from a comprehension failure (“I didn’t understand the request, so I produced nothing useful”).
Metric Question it answers
------- ----------------------------------------------------------
ASR Did the model comply with a harmful, country-grounded request?
NSR Did the model handle a benign request safely and usefully?
CSR Did the model recognise a culturally embedded sensitivity?
No exploit prompts are reproduced here; the dataset and methodology live in the paper and on the project’s Hugging Face release.
Why it matters
Across 37 models (10 frontier, 27 local), the paper reports two findings worth carrying into any deployment review.
First, among frontier models, jailbreak robustness and cultural awareness are not coupled. A model can resist adversarial prompts well and still be culturally tone-deaf, or vice versa. That means a single blended “safety score” hides the axis you actually care about. If you are shipping into a specific market, a global average tells you almost nothing about your local risk.
Second, and more pointed: local models showed a near-linear trade-off between ASR and NSR (r = -0.81). In plain terms, the models that looked “safest” were often safe because they failed to generate a useful answer at all — refusal by accident, not alignment by design. A low attack-success number that comes from the model not understanding the request is not safety; it is the Illusion of Safety with a good-looking dashboard.
For anyone deploying LLMs in more than one language — which includes most of this publication’s readers — the takeaway is that English red-teaming results do not transfer. The risk surface is country-shaped, and so are the blind spots.
Defenses
XL-SafetyBench is a defensive instrument, so the mitigations are mostly about how you evaluate and deploy.
-
Stop trusting translated benchmarks. If your safety evidence is an English suite run through machine translation, treat it as a lower bound, not a clearance. Re-test with prompts authored in-language, grounded in local law and fraud patterns.
-
Report per-axis, not a single number. Track refusal robustness and cultural comprehension separately. A blended score lets a culturally blind model pass on the strength of its jailbreak resistance.
-
Distinguish refusal from comprehension. Pair every “did it refuse?” check with a “did it understand?” check (the NSR/CSR idea). A model that stays silent because it is confused will fail open the moment a request is phrased just clearly enough.
-
Put native speakers in the loop. Automated judges and translation pipelines miss exactly the context that makes a request harmful in a given country. Dual native-speaker review is the part of the methodology that is hardest to fake and most worth copying.
-
Scope guardrails per market. Input/output filters tuned on English harms will under-catch local fraud framings and over-block benign local references. Maintain per-locale policies, and re-validate them when you add a language.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| XL-SafetyBench paper | arXiv 2605.05662 | 2026-05-07 | 5,500 cases, 10 country-language pairs |
| Public announcement | AIM Intelligence (EIN Presswire / NatLawReview) | 2026-06-05 | Dataset released on Hugging Face |
| Models evaluated | Paper | 2026-05-07 | 37 total (10 frontier, 27 local) |
| Key finding (frontier) | Paper | 2026-05-07 | Jailbreak robustness ≠ cultural awareness |
| Key finding (local) | Paper | 2026-05-07 | ASR–NSR trade-off r = -0.81 (“Illusion of Safety”) |
The right framing is not “models are unsafe abroad.” It is “a model that scores safe in English has not been tested where it is going to be used.” Multilingual deployment needs multilingual evidence — authored, not translated.