system: OPERATIONAL
← back to all hacks
ADVERSARIAL MEDIUM NEW

HPAA: typography humans read but moderation LLMs miss

A June 8, 2026 paper introduces Human-Perceptible Adversarial Attacks — harmful text that stays obvious to a human reader but slips past LLM content moderation through typographic manipulation.

2026-06-11 // 5 min affects: llm-content-moderation, text-moderation-pipelines, multimodal-llm-moderation

What is this?

On June 8, 2026, researchers posted “What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks” (arXiv 2606.09700). It names a class of attack the authors call Human-Perceptible Adversarial Attacks (HPAA): harmful text that a human reader recognizes instantly, but that an LLM-based content moderation system fails to flag.

The mechanism is not obfuscation in the usual sense. The harmful words are still there, still readable on screen. The attack exploits a perceptual mismatch: humans interpret a block of text using visual cues — spacing, emphasis, spatial arrangement — while a moderation model consumes the same content as a token stream that discards most of that visual structure. Content that is “readable as harmful” to a person can therefore be “effectively invisible” to the classifier reading it.

How it works

A moderation LLM does not see pixels. It sees tokens. Typography that a human brain reassembles into a clear word can be split by the tokenizer into fragments that no longer match the harmful term the safety model was trained to catch.

HPAA leans on three families of typographic manipulation, applied so that the visual reading is preserved while the tokenized reading is fragmented:

Lever                 Human reads it as…        Tokenizer sees…
--------------------  ------------------------  ----------------------------
Spacing               one coherent word         several short, benign chunks
Visual emphasis       a single emphasized term  decorative characters + stubs
Spatial arrangement   a phrase laid out in 2-D  a scrambled left-to-right run

No working payloads are reproduced here. The harmful surface string is represented as [REDACTED] — what matters for defenders is the shape of the bypass, not a copy-paste recipe. The point the paper makes is structural: the moderation model and the human are reading two different documents that happen to share the same pixels.

This sits next to, but is distinct from, image-channel evasion. Multimodal “smuggling” attacks such as Making MLLMs Blind hide harmful content inside rendered images; HPAA stays in the text channel and weaponizes the gap between rendered glyphs and tokens.

Why it matters

Content moderation is one of the most widely deployed safety uses of LLMs — comment filtering, marketplace listings, chat safety, abuse triage, ad review. Most of these pipelines assume that if a model can read the text, it can judge the text.

HPAA breaks that assumption in the worst direction. A false negative here is not a curiosity: it is harmful content reaching a human audience while the dashboard reports “clean.” Because the attack preserves human readability by design, it is purpose-built for content meant to be seen — harassment, hate speech, scams — rather than for slipping instructions past an agent. The author’s lab summarizes the asymmetry bluntly: people see text, but the LLM does not.

The uncomfortable corollary is that scaling the moderation model up does not obviously close the gap, because the gap lives in the tokenization and input representation, not in the model’s reasoning. A smarter classifier still reads the fragmented token stream.

Defenses

The fix is to stop pretending the token stream is the document a human sees, and to converge the two views before judging.

  1. Normalize before you classify. Run input through Unicode normalization, whitespace collapsing, homoglyph folding, and zero-width-character stripping before the moderation model. Much of HPAA’s spacing and emphasis trickery collapses under aggressive canonicalization.

  2. Render-and-read. Render the text the way a user will see it, then judge it through the visual channel — OCR or a vision model — and compare that verdict to the text-only verdict. A divergence between “what it renders as” and “what it tokenizes as” is itself a strong abuse signal. This is the same instinct behind defenses like Eyes Closed, Safety On, applied to moderation rather than jailbreak defense.

  3. Flag structural anomalies. Unusual intra-word spacing, decorative character runs, and 2-D layout in a field that should be plain prose are cheap to detect heuristically and rare in benign content. Treat them as “review,” not “pass.”

  4. Defense in depth. Keep deterministic keyword/regex layers (operating on the normalized form) alongside the LLM. They are dumb, but they are not fooled by the same things the model is fooled by.

  5. Test with perceptual adversaries. Add HPAA-style transformations to your red-team corpus and measure the false-negative rate on visually obvious harm, not just clean text. If your evaluation only uses untransformed strings, it is blind to exactly this failure.

Status

ItemReferenceDateNotes
HPAA paperarXiv 2606.097002026-06-08Introduces Human-Perceptible Adversarial Attacks on text moderation
Lab write-upCSU-JPG Lab2026”People see text, but LLM not”
Related (image channel)Making MLLMs Blind, arXiv 2604.069502026-04Smuggling via rendered images, distinct channel
Defense patternEyes Closed, Safety On, arXiv 2403.095722024-03Image-to-text transformation as a safety layer

The takeaway is not “moderation LLMs are useless.” It is narrower and more actionable: a moderation system that judges only the token stream is judging a different document than the one your users read. Close that gap — normalize, render, compare — before an attacker does it for you.

Sources