Multimodal input as attack surface: vLLM's video-decoder RCE (CVE-2026-22778)
CVE-2026-22778 turns a malicious video URL into remote code execution on vLLM servers, chaining a PIL info leak with an FFmpeg JPEG2000 heap overflow. Patched in 0.14.1.
In brief Most LLM security writing is about prompts. CVE-2026-22778 is a reminder that an inference server’s media-decoding pipeline is an attack surface too. A malicious video sent to a vLLM endpoint that serves a video model can chain a memory-address leak with a heap overflow in a bundled native decoder to reach remote code execution — no prompt injection involved. It affects vLLM
0.8.3–0.14.0; the fix is in 0.14.1. Default vLLM ships without authentication, which makes exposed deployments directly reachable.
What is this?
vLLM is one of the most widely deployed engines for serving large language models, with a Python package downloaded over three million times a month. When it serves a multimodal model, user input is no longer just text: images and videos are submitted to the API and decoded before they reach the model.
CVE-2026-22778 (GitHub advisory GHSA-4r2x-xpjr-7cvv), analysed publicly by OX Security in a write-up first published February 2, 2026 and tracked since across NVD and several vulnerability databases, is a remote code execution bug in that decoding path. An attacker who can submit a crafted video URL to a vLLM instance serving a video model can run arbitrary commands on the host. Deployments that do not serve a video model are not affected.
How it works
The disclosed issue is a chained exploit — two separate weaknesses that are weak on their own but devastating together. We describe the mechanism at a conceptual level; no working payload is reproduced here.
The first link is an information leak. When an invalid image is sent to the multimodal endpoint, the Python Imaging Library (PIL) raises an error whose message is returned to the client. That message includes a heap memory address. Leaking a single valid address collapses the search space that Address Space Layout Randomisation (ASLR) is meant to protect — defenders have described the effect as reducing roughly four billion possible guesses to a handful.
The second link is a heap buffer overflow in the JPEG2000 decoder. vLLM uses OpenCV for video decoding, and OpenCV bundles FFmpeg 5.1.x. The JPEG2000 decoder trusts the image’s channel-definition (cdef) metadata to remap colour channels without re-validating buffer sizes, so data destined for one channel can be written into a smaller buffer for another, overflowing into adjacent heap memory. Because the attacker controls both the frame geometry and the channel mapping, they control how much memory is overwritten and which neighbouring objects are hit. Combined with the leaked address from step one, that control is enough to overwrite a function pointer and redirect execution into a libc routine such as system().
The trigger is ordinary API usage: a video URL passed to a completions or invocations call for a video model. A default vLLM instance installed from pip or Docker has no authentication, so an internet-exposed endpoint can be reached directly.
Why it matters
The lesson generalises well beyond this one CVE. Teams reason hard about prompt injection and jailbreaks, then forward untrusted images and videos straight into native C/C++ decoders — code with a long history of memory-safety bugs — running inside the same process as the model and its credentials. The “AI” part of the stack inherits every classic memory-corruption risk of the media-processing libraries underneath it.
The blast radius is large because of where inference servers sit: close to GPUs, model weights, internal networks and API keys. RCE here means full server takeover, data exfiltration and lateral movement. And the underlying overflow lived in a bundled third-party dependency, so an organisation could be exposed without ever having written or reviewed the vulnerable line. As of disclosure there was no public evidence of in-the-wild exploitation, but the affected range spans many releases.
Defenses
- Upgrade to vLLM 0.14.1 or later. The fix landed across pull requests #31987, #32319 and #32668 — it sanitises the leaking error messages and updates the vulnerable decoder. Check your version with
pip show vllm. - Disable the video model feature in production until patched if you cannot upgrade immediately. Deployments that serve only text or images via the fixed paths are not exposed to this specific chain.
- Never expose a raw vLLM endpoint to untrusted networks. Default vLLM has no auth; put it behind an authenticated gateway, restrict ingress, and treat every uploaded media object as hostile input.
- Sandbox the decoding and inference tier. Run serving in a minimal container with restricted egress and least-privilege credentials, segmented from sensitive data stores, so a decoder compromise cannot pivot to the rest of the environment.
- Don’t return raw library errors to clients. Leaking exception text — addresses, stack frames, paths — is a recurring info-leak primitive. Catch and replace internal error messages at the API boundary.
- Track your native dependencies. Image and video decoders (OpenCV, FFmpeg, PIL) are part of your LLM attack surface. Pin and monitor them in your SBOM and patch them with the same urgency as the model framework itself.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| CVE-2026-22778 advisory | GitHub (GHSA-4r2x-xpjr-7cvv) | 2026 | Chained info leak + heap overflow, RCE via video |
| Public technical analysis | OX Security | 2026-02-02 | PIL address leak + FFmpeg JPEG2000 overflow |
| Affected versions | vLLM | 0.8.3 → 0.14.0 | Only deployments serving a video model |
| Patched version | vLLM 0.14.1 | — | Fix PRs #31987, #32319, #32668 |
The right framing is not “another vLLM CVE.” It is that multimodal endpoints quietly extend an LLM service’s attack surface into decades-old native media code — and that surface deserves the same isolation, input distrust and dependency hygiene as any other untrusted parser.