Bleeding Llama: a GGUF parsing flaw leaks Ollama process memory to unauthenticated attackers
CVE-2026-7482, publicly disclosed in May 2026 and codenamed Bleeding Llama by Cyera, lets a remote attacker pull arbitrary chunks of an Ollama server's heap — API keys, system prompts, other users' conversations — with three unauthenticated API calls. The silent patch shipped 2.5 months before the CVE was assigned.
What is this?
In May 2026, researchers at Cyera publicly disclosed CVE-2026-7482 (CVSS 9.1), an unauthenticated out-of-bounds read in Ollama, the popular local-inference runtime. They named it Bleeding Llama. The flaw lets a remote attacker who can reach an Ollama server’s HTTP API pull arbitrary chunks of the server’s heap memory — environment variables, API keys, system prompts, and fragments of other users’ conversations — without ever authenticating.
The defect lives in Ollama’s GGUF parser. When the server processes a GGUF model file whose declared tensor offset and size exceed the file’s actual length, functions in fs/ggml/gguf.go and server/quantization.go read past the allocated heap buffer during quantization. The result is a controllable heap leak.
The fix shipped in Ollama v0.17.1 on February 25, 2026 without being flagged as a security release. The CVE was only assigned by Echo (a third-party CNA) on April 28, 2026 after MITRE failed to respond for nearly two months, and the public write-up landed in May 2026. For roughly ten weeks, the patch existed but operators had no signal that they needed to prioritise it. Internet-exposed Ollama instances — estimated at over 300,000 servers globally — were vulnerable across that whole window.
How it works
Bleeding Llama is a textbook trust-boundary failure: a server parses an untrusted file format and trusts the metadata fields that describe how to read the payload.
# Conceptual sketch based on the public Cyera advisory.
# No exploit payload against any live system is reproduced.
[ attacker ]
│
│ 1. POST /api/create ─── upload crafted GGUF with inflated tensor shape
▼
[ Ollama server ]
│
│ 2. parse GGUF metadata
│ └── tensor offset / size are NOT bounds-checked against file length
│
│ 3. quantization step reads N bytes from heap
│ where N comes from the attacker-controlled tensor descriptor
│ └── out-of-bounds read past the allocated buffer
│
│ 4. POST /api/push ─── server pushes the resulting "model" out
▼
[ attacker exfil endpoint ]
│
└── heap bytes: env vars, API keys, system prompts, other users' chats
Two pieces of context matter for the impact.
First, the GGUF format is the standard packaging for local model weights — every modern open-weight model is shipped this way. A GGUF file is just a binary blob with a metadata header that tells the loader where each tensor lives and how big it is. Bleeding Llama is the bug class you’d predict from that design: the parser believed the header.
Second, the two endpoints used in the chain (/api/create and /api/push) are unauthenticated by default in Ollama. The upstream documentation notes this, and the default bind address is 127.0.0.1, but many real-world deployments override it with OLLAMA_HOST=0.0.0.0 so the box can serve a developer’s network or a container fleet. That single environment-variable change is what turns Bleeding Llama from a local annoyance into a remote, internet-exposed primitive.
The leaked memory is the Ollama process heap, which routinely contains: the system prompt currently being served, recent user prompts and model outputs, environment variables (which on cloud VMs often include AWS / GCP / Anthropic / OpenAI keys), and TLS material if any has been touched. Three API calls are enough to reproduce a controllable disclosure.
Why it matters
Three things to flag for anyone running LLM infrastructure.
The first is the most obvious: the attack surface of “local” LLM runtimes is much wider than teams treat it as. Ollama is often deployed under the mental model of a developer tool, but in practice it is a network-reachable inference server that handles secrets and PII the moment any user talks to it. Scans by external researchers in May 2026 found that a substantial fraction of self-hosted AI infrastructure is exposed to the public internet with no authentication. Bleeding Llama is a memory-disclosure example, but the same posture is what made the earlier CVE-2026-33626 (LMDeploy SSRF) and the LiteLLM SQLi (CVE-2026-42208) wildly exploitable within hours of disclosure.
The second is the silent-patch problem. The fix shipped in v0.17.1 on February 25 with a non-security release note. The CVE was issued on April 28. For seventy days, operators using vulnerability scanners or patch-management tools had no CVE to match against and no severity signal pointing them at the upgrade. This pattern is not specific to Ollama — many AI frameworks lack a security advisory pipeline, and several MITRE CNA backlogs have slipped over the past year. If your AI inventory depends on CVE feeds to trigger patching, you are systematically behind on AI infrastructure.
The third is the GGUF supply-chain implication. Model files are now the equivalent of executable artifacts — they drive complex parsing logic on the server. Treating them as inert data is wrong. Any pipeline that ingests GGUF files from external sources (Hugging Face downloads, mirrored model registries, user-uploaded fine-tunes) is exposed to whatever parsing bugs exist in the consumer. Bleeding Llama is one such bug; it almost certainly will not be the last.
Defenses
Upgrade to Ollama v0.17.1 or later. This is the only fix for the underlying parser bug. Older releases are not safely patchable in place because the bounds checks are added throughout the GGUF and quantization code paths.
Audit your bind address and authentication. If your Ollama runs with OLLAMA_HOST=0.0.0.0 or behind a public load balancer, treat it as an exposed service. Bind to 127.0.0.1 and reach it via SSH or a VPN, or front it with a reverse proxy that enforces authentication and rate-limits /api/create and /api/push. The runZero advisory documents query strings you can use to find your own exposed instances.
Network-segment your LLM runtime away from secrets. A leaked heap can only exfiltrate what the process touches. Do not pass production cloud credentials, third-party API keys, or PII through environment variables of a public-facing inference server. Use a sidecar with a tightly scoped IAM role, or a secrets broker that hands out short-lived tokens on demand. This is the same principle that limits the blast radius of SSRF and RCE in the same family of AI frameworks.
Treat GGUF as untrusted input. If your pipeline pulls model files from any source you do not control end-to-end, validate the file header out-of-process — for example by parsing the metadata in a sandboxed binary and refusing files whose declared tensor extents do not match the file length. Several open-weight model registries are now starting to publish signed GGUF manifests; prefer them.
Subscribe to advisories from your AI runtime vendor, not just CVE feeds. Bleeding Llama is the case study for why CVE feeds alone are insufficient. Subscribe directly to Ollama’s GitHub Security Advisories, LiteLLM’s, LMDeploy’s, vLLM’s, and your inference vendor’s. Watch their release notes for silent patches and back-port the assumption that any non-trivial parser change might be security-relevant.
Adopt the OWASP LLM Top 10 supply-chain mitigations. OWASP LLM03 (Supply Chain) and LLM07 (System Prompt Leakage) directly apply here. The 2026 revision now explicitly references model-file parsing as part of the supply-chain attack surface.
Status
| Item | Reference | Date | Notes |
|---|---|---|---|
| CVE | CVE-2026-7482, CVSS 9.1 | Assigned 2026-04-28 (Echo CNA) | Initial MITRE submission 2026-03-02; reassigned by Echo after no MITRE response |
| Discovery | Cyera Research (“Bleeding Llama”) | Disclosed publicly May 2026 | Responsible-disclosure timeline documented in Cyera advisory |
| Patch | Ollama v0.17.1 | 2026-02-25 | Shipped without security flag in release notes |
| Public write-up | Cyera / The Hacker News | 2026-05 | Confirms 300,000+ exposed servers worldwide |
| Affected component | fs/ggml/gguf.go, server/quantization.go | — | Out-of-bounds heap read during quantization |
| Affected versions | Ollama < 0.17.1 | — | Includes all earlier minor branches |
| OWASP mapping | LLM03 (Supply Chain), LLM07 (System Prompt Leakage) | 2026 revision | Model-file parsing now part of supply-chain scope |
| Related disclosures | LMDeploy CVE-2026-33626 (SSRF), LiteLLM CVE-2026-42208 (SQLi), Langflow CVE-2026-33873 (RCE) | 2026 | Same pattern: unauthenticated AI-framework endpoints |
Bleeding Llama is not an exotic novel attack — it is a classic memory-safety bug in a parser, wrapped around an AI-specific file format. What makes it worth a flag is the operational reality around it: the runtime is more network-exposed than its developers expect, the patch was silent, the CVE was late, and the leaked bytes are exactly the secrets that LLM deployments accumulate. Treat your inference servers like the production data-plane services they have become.
Sources
- → https://www.cyera.com/research/bleeding-llama-critical-unauthenticated-memory-leak-in-ollama
- → https://thehackernews.com/2026/05/ollama-out-of-bounds-read-vulnerability.html
- → https://nvd.nist.gov/vuln/detail/CVE-2026-7482
- → https://threatprotect.qualys.com/2026/05/11/ollama-heap-out-of-bounds-read-vulnerability-leads-to-remote-process-memory-leak-cve-2026-7482/
- → https://www.csoonline.com/article/4168584/ollama-vulnerability-highlights-danger-of-ai-frameworks-with-unrestricted-access.html