system: OPERATIONAL
← back to all hacks
DATA LEAK MEDIUM NEW

Side channels on LLM inference: your prompts leak despite TLS

Speculative decoding and streaming responses create traffic patterns that leak prompt topics, languages, even PII — through encrypted connections. A look at three papers and the defenses.

2026-06-17 // 6 min affects: chatgpt, claude, vllm, open-weight-llms

What is this?

Side-channel attacks don’t read the contents of your conversation with an LLM — they read its shape. The size and timing of the encrypted packets streaming back from a model carry enough structure to infer what you asked about, even though TLS hides every byte of the actual text. On February 17, 2026, Bruce Schneier grouped three papers that make this concrete, and together they describe a privacy-leakage class that is independent of prompt injection or jailbreaks and that affects production services from major providers.

The throughline: the optimizations that make modern LLM serving fast — streaming token-by-token output, speculative decoding, parallel decoding — are data-dependent. How fast tokens arrive and how many ship per network flush depends on what the model is generating. That dependence is a measurable signal. We are covering it because it is a structural privacy risk that no amount of input filtering or output moderation addresses, and because defenders rarely think of “the network trace of a chat session” as sensitive.

How it works

Three published results map the surface. None of them require breaking encryption.

Remote Timing Attacks on Efficient Language Model Inference (arXiv 2410.17175, posted October 2024) shows that techniques like speculative sampling and parallel decoding introduce data-dependent timing characteristics. By passively monitoring encrypted traffic between a user and a remote model, an observer learns when responses run faster or slower. On open-source systems the authors recover a conversation’s topic — for example, medical advice versus coding help — with 90%+ precision; against production ChatGPT and Claude they distinguish specific messages or infer the user’s language; and an active adversary using a boosting technique can recover PII such as phone numbers or card numbers from open-source deployments.

When Speculation Spills Secrets (arXiv 2411.01076, posted November 2024) isolates speculative decoding specifically. Because the scheme verifies several candidate tokens in parallel, the per-iteration count of accepted versus rejected tokens is input-dependent and visible as packet sizes. Tested on research prototypes and production-grade vLLM, an observer fingerprints user queries from a set of 50 prompts with over 75% accuracy at temperature 0.3 — REST 100%, LADE 91.6%, BiLD 95.2%, EAGLE 77.6% — staying far above the 2% random baseline even at temperature 1.0. The same channel leaks confidential datastore contents used for prediction at rates exceeding 25 tokens/sec.

Whisper Leak (arXiv 2511.03675, posted November 2025) generalizes the streaming case across 28 popular LLMs from major providers, classifying prompt topics from packet size and timing with often >98% AUPRC, and reaching 100% precision on sensitive topics like “money laundering” even at a 10,000:1 noise-to-target imbalance. The authors disclosed responsibly and worked with providers on initial countermeasures.

What an eavesdropper sees           What it leaks
----------------------------------  -----------------------------------------
Inter-token arrival timing          Topic class, conversation language
Per-iteration token / packet count  Speculative accept/reject pattern → query
                                     fingerprint, datastore contents
Streaming packet size distribution  Topic classification across many models

Why it matters

This sits in a different threat model from most LLM attacks. The adversary is anyone who can observe the network path — an ISP, a government performing surveillance, someone on the same Wi-Fi, or a compromised upstream router — and they never need an account, a malicious prompt, or access to the model. The leak survives TLS because it lives in metadata, not plaintext. For anyone using an LLM for medical, legal, financial, or otherwise confidential matters, “what topic am I discussing” is itself sensitive, and topic inference at 98% AUPRC is a real disclosure. The datastore-extraction result is worse: it can pull retrieval content out of a serving system through timing alone. This connects to the broader inference-side leakage problem we have covered in prefix-cache timing prompt stealing and inference leakage budgets — the serving layer, not just the model, is an attack surface.

Defenses

The papers propose and evaluate concrete mitigations. The honest summary from the Whisper Leak authors is that each one helps but none fully closes the channel, so layer them.

  1. Pad packet sizes. Random padding and fixed-size buffering blur the size signal that fingerprints queries. It costs bandwidth; budget for it on sensitive endpoints.

  2. Batch and aggregate tokens before flushing. Iteration-wise token aggregation and token batching break the one-token-per-packet timing relationship that speculative decoding exposes. This trades a little perceived latency for a lot of signal reduction.

  3. Inject cover traffic. Packet injection adds decoy flushes so the observable stream no longer tracks generation. Evaluated by Whisper Leak as a partial control.

  4. Treat speculative/parallel decoding as a privacy setting. For high-confidentiality workloads, consider disabling speculative decoding or running the model in an isolated, local deployment so there is no observable wire between user and model.

  5. Don’t rely on TLS alone for confidentiality. If your users may face network-level adversaries, document that prompt topics can leak and route sensitive use through padded/batched endpoints or on-prem inference.

Status

These are published, peer-reviewed-track findings, not zero-days, and the streaming variant was responsibly disclosed with vendor countermeasures underway. Treat the mitigations above as the current state of the art: they reduce, but do not eliminate, metadata leakage from LLM serving.

Sources