MULTIMODAL MEDIUM NEW

Sirens' Whisper: inaudible near-ultrasonic jailbreaks of voice LLMs

A March 14, 2026 paper from Huazhong, Tsinghua and Microsoft hides jailbreak prompts in the 17–22 kHz band. Microphone nonlinearity demodulates them back into commands — silent to humans, up to 0.94 non-refusal on commercial voice LLMs.

2026-06-18 // 7 min affects: deepseek, glm-4-air, grok-4, glm-4-voice, qwen-omni-turbo

What is this?

On March 14, 2026, researchers from Huazhong University of Science and Technology, Tsinghua University and Microsoft published Sirens’ Whisper (SWhisper), a framework that delivers jailbreak prompts to speech-driven LLMs through a channel humans cannot hear. The prompt is encoded into the 17–22 kHz near-ultrasonic band, played through an ordinary speaker, and then demodulated back into an audible command by the nonlinearity of the victim’s microphone. To a person in the room it sounds like silence — a controlled user study found the injected audio “perceptually indistinguishable from background-only playback.” To the model, it is a spoken instruction.

This is the acoustic-side-channel idea behind DolphinAttack and NUIT, carried forward to the era of voice assistants backed by large language models. The contribution is not “ultrasound can reach a microphone” — that is well known — but that a structured, multi-sentence jailbreak prompt can survive the trip and steer a black-box commercial voice LLM. We cover it because voice is becoming a default interface (Apple, Google and Amazon all ship speech-driven assistants), and an inaudible prompt-injection channel changes the threat model for every one of them.

How it works

A microphone is not a perfectly linear device. Its response includes higher-order terms — modelled in the paper as S_out = k1·S_in + k2·S_in² + k3·S_in³ + …. The quadratic term k2·S_in² mixes a high-frequency carrier down into the audible baseband. SWhisper exploits exactly this: it modulates the target audio onto a near-ultrasonic carrier via single-sideband modulation, and the microphone’s own hardware does the “decoding.”

The hard part is fidelity. Near-ultrasound suffers heavy air absorption and irregular hardware response above 17 kHz, so a naive carrier arrives as garbage. The paper’s core move is channel-inversion pre-compensation: it models the compound microphone-and-channel transfer function, then pre-distorts the waveform so that what lands in the baseband matches the intended prompt across different devices and rooms.

Attacker speaker                Victim microphone            Speech LLM
----------------                -----------------            ----------
prompt → SSB-modulate    →      nonlinear demod        →     "transcribes"
into 17–22 kHz, with            (k2·S_in² term)              the recovered
channel-inversion               recovers baseband           prompt as a
pre-compensation                audio in the clear          spoken command

No payloads are reproduced here. The threat model is the operative detail. The victim model is treated as a black box (audio in, audio out); the attacker optimises against a white-box surrogate and relies on transfer. The attack must land in a single query, uses commodity speakers (no specialised ultrasonic rig), and was demonstrated at ~1 m, 0° orientation, in 36–38 dB ambient noise. Reported effectiveness reaches up to 0.94 non-refusal and 0.925 specific-convincing on commercial models, evaluated with the StrongREJECT methodology over an AdvBench prompt subset. Tested targets included DeepSeek (Non-Thinking mode), GLM-4-Air and Grok-4 as voice-enabled LLMs, plus end-to-end audio models GLM-4-Voice and Qwen-Omni-Turbo.

Why it matters

Text guardrails never see this attack. Input filtering, moderation prompts and instruction-hierarchy training all operate on the transcript — but the malicious instruction is injected below the application, in the analogue gap between an ordinary loudspeaker and the microphone. By the time the audio becomes text, it already looks like a legitimate user utterance.

The constraints are real and worth stating plainly: the attack needs a speaker within roughly a metre, is sensitive to angle and distance, and as a jailbreak it mostly produces disallowed content rather than privileged actions. But two trends raise the stakes. Voice is moving from “ask a question” to “do a thing” — agents that send messages, control devices or trigger tool calls. And the authors note the same covert channel “enables a broader class of high-fidelity prompt-injection and command-execution attacks,” not just jailbreaks. An inaudible instruction that reaches an agent with real tools is the part defenders should plan for now.

Defenses

The injection happens at the signal layer, so the defence has to start there and continue up the stack. The paper itself discusses both signal-based and text-based countermeasures; the durable principles are well established in the acoustic-injection literature.

Low-pass / anti-aliasing filtering before the model. Band-limit and filter the microphone path so energy above the human voice range (roughly >8 kHz) is attenuated before it reaches automatic speech recognition. This directly attacks the carrier the demodulation depends on.
Detect near-ultrasonic energy. Monitor the 17–22 kHz band for the structured, sustained signals these attacks require. Persistent high-frequency content during a “spoken” command is an anomaly worth flagging or rejecting.
Harden the microphone front end. Hardware and firmware that suppress nonlinear demodulation (improved analogue design, ultrasonic guards) remove the physical primitive. This is the most complete fix and the slowest to deploy.
Gate actions, not just words. Treat any voice-initiated, high-impact action — sending data, messaging, purchases, device or tool control — as requiring explicit, out-of-band confirmation. A jailbroken transcript should not be sufficient to act.
Add liveness and provenance checks. Speaker verification, challenge-response, and rejecting commands that lack normal conversational context raise the cost of a single-shot inaudible injection.
Threat-model the analogue gap. Security reviews of voice agents should explicitly include physical-acoustic channels, not only the text interface. Assume the microphone can be addressed by signals the user never hears.

Status

Item	Reference	Date	Notes
SWhisper paper (arXiv:2603.13847v1)	Huazhong U. / Tsinghua / Microsoft	2026-03-14	First framework for covert near-ultrasonic prompt injection into black-box speech LLMs
Carrier band	Paper	2026-03-14	17–22 kHz, single-sideband modulation, channel-inversion pre-compensation
Reported effectiveness	Paper	2026-03-14	Up to 0.94 non-refusal / 0.925 specific-convincing on commercial models
Human perceptibility	User study	2026-03-14	Injected audio indistinguishable from background-only playback
Targets evaluated	Paper	2026-03-14	DeepSeek, GLM-4-Air, Grok-4; LALMs GLM-4-Voice, Qwen-Omni-Turbo

The takeaway is not that any one voice model is broken — it is that the microphone is part of your attack surface. As speech-driven LLMs gain the ability to act, the analogue channel between a speaker and a mic becomes an injection path that no amount of text-level alignment can close. The defences that matter are signal filtering, hardware hardening, and refusing to let a transcript alone authorise consequential actions.