MULTIMODAL CRITICAL

AudioHijack: imperceptible audio hijacks voice agents (IEEE S&P 2026)

An April 16, 2026 IEEE S&P paper introduces auditory prompt injection: adversarial reverb hidden in audio drives 13 large audio-language models and commercial voice agents (Mistral AI, Microsoft Azure) into unauthorized actions with 79-96% success.

2026-05-26 // 7 min affects: mistral-voxtral, azure-voice-agents, qwen2-audio, salmonn, gpt-4o-audio, lalm-13

What is this?

On April 16, 2026, Meng Chen and colleagues from Zhejiang University, Nanyang Technological University and the National University of Singapore posted Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection on arXiv (2604.14604, cs.CR). The paper has been accepted at IEEE S&P 2026 and introduces a category the authors call auditory prompt injection against Large Audio-Language Models (LALMs).

The result is uncomfortable. A short adversarial signal — trained in roughly half an hour, then convolutionally blended into ordinary reverb — embeds attacker instructions into any audio the user plays near a voice agent. The user hears a normal podcast, song, video or voice note. The model hears a control channel. On 13 state-of-the-art LALMs, average success rates run 79% to 96% across six misbehaviour categories. In a real-world study, the same signal drove commercial voice agents from Mistral AI and Microsoft Azure into web searches, file downloads and email exfiltration on the user’s behalf.

This is the first time auditory prompt injection has been demonstrated as context-agnostic (the same signal works regardless of what the user actually said) and imperceptible (the perturbation hides inside natural reverberation).

How it works

Standard LALMs — Qwen2-Audio, SALMONN, GPT-4o-audio class systems, the Mistral and Azure voice stacks — take a continuous waveform, tokenise it through a non-differentiable audio front-end, and feed those tokens into a text LLM. Two properties of that pipeline are what AudioHijack exploits.

First, the audio channel is continuous and high-dimensional, so small perturbations have far more degrees of freedom than text. Second, the tokeniser is non-differentiable, which historically blocked end-to-end gradient attacks; the paper bypasses this with sampling-based gradient estimation.

The framework has three pieces.

Attention supervision. During optimisation the perturbation is rewarded for shifting model attention onto the adversarial slice of audio and away from the user’s words. This is what gives the attack its context-agnostic property — the model “listens to” the adversarial audio regardless of what the human said.

Multi-context training. Each perturbation is trained against many random user utterances so it generalises to unseen contexts. The paper reports 79%-96% success rates on user contexts that were never seen at training time.

Convolutional blending. Raw adversarial noise is audible. AudioHijack convolves the perturbation with a natural room impulse response so it is perceived as reverberation. Listening studies in the paper confirm users do not hear it as an attack — only as ambient acoustics.

Component                Purpose                              Effect on LALM
-----------------------  -----------------------------------  -----------------------------------
Sampling-based gradient  Estimate gradient through non-       Enables end-to-end optimisation
estimation               differentiable audio tokenizer        against black-box-like pipelines
Attention supervision    Steer model attention to adversarial Decouples attack from user content
                         audio slice                          (context-agnostic)
Multi-context training   Train on diverse user prompts        Generalises to unseen contexts
Convolutional blending   Embed perturbation into reverb       Imperceptible to listeners

The misbehaviour set the paper measures includes six categories — refusing legitimate tasks, leaking system instructions, fabricating tool calls, performing unauthorised tool calls, generating disallowed content, and silently substituting the user’s intent. Real-world tool-call demonstrations cover downloading attacker-controlled files, sending emails containing user data, and steering web searches — all triggered while the user is asking the agent about something else entirely.

No payload is reproduced here. The arXiv paper, the authors’ code release on GitHub and the IEEE S&P 2026 publication are the canonical references for researchers who want to reproduce the result in a lab.

Why it matters

Three properties make this class harder than text-domain prompt injection.

First, the trust model is broken at the modality boundary. A voice agent already accepts audio from the environment as a primary input. There is no equivalent of “untrusted document” framing for a sound the user voluntarily played. The user’s microphone is doing exactly what it was designed to do.

Second, transfer to commercial systems. The paper’s real-world section is the part defenders should read first: adversarial audio generated locally transferred to Microsoft Azure and Mistral AI voice agents and induced them to perform sensitive actions through single or cascaded tool calls. This is not a closed-form lab result — it crosses the gap to production-grade voice stacks.

Third, defences shipped today are weak. The authors evaluated two natural mitigations and report blunt numbers: prompt-level “watch out for suspicious instructions” hardening cut attack success rate by only 7 percentage points, and intent-verification (the model checks whether its response matches what the user asked for) caught just 28% of attacks. Neither approach is anywhere near a fix.

The broader pattern matters for anyone deploying multimodal agents. Each new input modality — audio, image, video, sensor — is a new injection channel that text-only defences will not cover. AudioHijack is the audio case study; the structural lesson is wider.

Defenses

No single mitigation retires this class as of late May 2026. The shortest defensible list, drawn from the paper itself and from standard multimodal-security practice:

Authenticate the input channel, not just the content. Voice agents should distinguish audio the user directly spoke into the microphone from audio played through a speaker in the environment. Hardware presence signals (near-field vs far-field, second microphone array, vibration) can give the agent a notion of origin that text-only pipelines never had.
Treat ambient audio as untrusted by default. When an audio segment cannot be confidently attributed to the active speaker, downgrade its authority: do not allow tool calls or memory writes derived from it without a confirmation step.
Adversarial training and certified defences. The paper notes that ad-hoc prompt hardening is ineffective. Adversarial training on AudioHijack-style perturbations, randomised input transformations (resampling, noise injection, MP3 round-tripping) and certified-robustness techniques are the directions worth funding, with the explicit caveat that none of these are solved.
Restrict tool surface for voice agents. A voice agent that cannot send mail, cannot download arbitrary files and cannot browse to arbitrary URLs cannot be made to do those things from a hijacked prompt. Apply the Agents Rule of Two — at most two of “untrusted input / sensitive tools / exfiltration channel” at a time.
Require explicit confirmation for high-stakes actions. Sending email, downloading files, transferring money, changing settings: a brief spoken or on-screen confirmation breaks the silent attack path even when the prompt injection succeeds in the model.
Log and replay audio context for high-authority actions. When a voice agent performs a sensitive action, the audio that preceded it should be retained and reviewable so post-hoc forensics can recognise an AudioHijack-style overlay.
Watch for the cross-modal pattern, not just audio. The same structural problem — a non-text modality with a continuous, high-dimensional input space and a non-differentiable front-end — applies to vision, sensor and video LLMs. Defences should be built modality-agnostic.

Status

Item	Reference	Date	Notes
Paper	arXiv:2604.14604 v1	2026-04-16	Accepted at IEEE S&P 2026
Code	github.com/zju-muslab/AudioHijack	2026-04	Reference implementation
Affected LALMs	13 state-of-the-art	—	79%-96% average ASR on unseen contexts
Affected commercial agents	Mistral AI voice agent; Microsoft Azure voice agents	2026-04	Real-world tool-call hijacking demonstrated
Defences tried	Prompt hardening; intent verification	2026-04	-7 pp ASR; 28% detection — insufficient
Category	Multimodal prompt injection	—	New attack class proposed by authors

Audio used to be the modality where prompt injection was studied at the jailbreak level — get the model to say something it would refuse if asked in text. AudioHijack is one step further: get the agent to do something on the user’s behalf, while the human in the room hears only ordinary reverberation. The April 2026 paper does not retire any defence; it does retire the assumption that voice was the safer channel.