Your LLM Cluster Is at 90% HBM and 60% Is KV Cache. Welcome to the Disaggregation Cliff.

vLLM prefix caching is great. It stops at one node. When your fleet of 50 H100s is bottlenecked on KV cache and adding GPUs is not financially viable, the next architecture is disaggregated KV cache. Here is the wall, the math, Mooncake, and what to actually do on Monday.

By Sharon Sahadevan·May 19, 2026·20 min read

You inherit an LLM serving cluster. 50 nodes, each with an H100. The dashboards look fine: GPU utilization averages 78%, p99 latency is within SLO, no obvious incidents in the last quarter. Then you spike traffic by 30% and the cluster tips. p99 latency doubles. Request queue depth climbs. Pods start preempting in-flight sequences.

You check the memory. Every node is sitting at 92% HBM utilization. You drill in: 60% of that HBM is KV cache. The model weights are not the problem. Activations are not the problem. The KV cache is.

The obvious answer is "add more GPUs." Each H100 node runs about $40K per month at hyperscaler prices. The CFO has already asked twice why the GPU bill is going up faster than the user count. You need a different answer.

This is the KV cache wall. Most teams hit it somewhere between 30 and 100 nodes of serious LLM serving. The single-node optimizations (PagedAttention, FP8 KV, careful batch sizing) that got you here cannot get you past it. The architectural shift is treating KV cache as a cluster-wide resource instead of a per-pod allocation. This post is what the wall looks like, why each single-node lever runs out, and the disaggregation pattern that Moonshot AI and others are using in production today.

Why KV cache eats your HBM#

The KV cache holds the per-layer key and value tensors generated during prefill and reused at every decode step. Without it, decode would re-prefill the whole sequence on every token and decode latency would scale quadratically with sequence length. With it, decode is constant per token. That is non-negotiable.

What makes KV cache the dominant memory consumer at scale is the shape of how it grows:

Linear in sequence length. Every token adds another K and V tensor at every layer.
Linear in batch size. Every concurrent request holds its own KV cache resident for its entire lifetime.
Cannot be evicted mid-request. Drop the cache and you have to redo the prefill from scratch.

The math is brutal. For a 70B model at FP16, KV cache per token is roughly 320 KB (80 layers, 8 KV heads, 128 head dim, 2 tensors at 2 bytes each, give or take depending on architecture). At 4K context and batch size 16, that is about 20 GB of HBM just for KV cache. Push context to 32K and batch size 32 and you are at 320 GB, more HBM than fits on any single H100 even if model weights were free. With the 1M-context models becoming standard for frontier serving, the math goes from painful to impossible on any one node.

Then there is the repetition problem. Production traffic almost never sees unique prompts. A customer support chatbot serves 50K conversations with the same 2K-token system prompt at the front of every request. A RAG product reuses the same retrieved chunks across follow-up questions. An agent loop replays the same plan and tool descriptions every iteration. Naively, every request re-prefills the shared portion. That is the single biggest source of unnecessary cost in most production LLM stacks today.

GPU Memory Fragmentation Explained covers why the HBM you "have" is rarely the HBM you can actually use. The KV cache wall is the next layer up from that one.

The single-node levers and where each one runs out#

Most teams reach for the same five levers when KV cache pressure shows up. Each one helps. Each one hits a wall.

1. Bigger GPUs. H100 to H200 (141 GB HBM) to B200 (192 GB HBM). Linear cost, sublinear gains as your workload keeps growing. Buys time, not architecture.

2. Smaller batches. Cuts concurrent KV. Throughput drops proportionally. The economics of inference depend on saturating the GPU with concurrent decode; smaller batches mean fewer tokens per dollar.

3. Shorter context. Caps the worst case but limits what your application can do. Long-context retrieval workflows, agent loops with growing scratchpads, multi-turn conversations all push back hard on this.

4. KV cache quantization (FP8, INT8, INT4). Cuts KV memory 2x to 4x. vLLM, SGLang, and TensorRT-LLM all support it. Comes with a 0.5 to 3 percent quality drop depending on model and workload. Helpful, but bounded. You cannot quantize your way out of a 1M-context workload.

5. vLLM prefix caching (single-node). vLLM has supported automatic prefix caching since 0.5; in 2026 it is enabled by default. When two requests on the same vLLM instance share a prefix, the second one reuses the KV pages. Single biggest gain you can get from a config flag.

The fifth one is what most teams reach for last and stop at. It is excellent. It is also the lever that has the most misunderstood ceiling.

vLLM prefix caching is intra-node. The shared KV pages live in one vLLM process's HBM. Two requests routed to two different vLLM pods get zero cache benefit, even if they share a 2K-token system prompt verbatim. With a round-robin or random load balancer, your effective prefix hit rate across the cluster is roughly 1/N where N is the number of pods serving the workload. At 50 pods, you are hitting 2% of the theoretical prefix-caching upside.

That is the wall. Every single-node optimization makes better use of one node's HBM. None of them can pool HBM across nodes, and none of them can pool prefix cache across nodes. The real shift happens when you stop treating each node as isolated.

KEY CONCEPT

vLLM prefix caching is per-vLLM-instance, not per-cluster. With a round-robin load balancer across 50 pods, you capture roughly 2% of the theoretical prefix reuse upside. The fix is not "add more pods" or "more HBM per pod"; it is "stop letting each pod own its cache." That is the disaggregation move.

The shape of the next architecture#

The shift is borrowing the playbook CPU architectures used 40 years ago: a memory hierarchy. Not all bytes are the same; the right answer is tiered.

+---------------------------------------------------------------+
| Tier 1: GPU HBM (per node)                                    |
|   Capacity: 80 to 192 GB per GPU                              |
|   Bandwidth: ~3 TB/s                                          |
|   Latency: ~hundreds of nanoseconds                           |
|   Role: active KV cache, decoding right now                   |
+---------------------------------------------------------------+
                          |  PCIe / NVLink (50 to 900 GB/s)
                          v
+---------------------------------------------------------------+
| Tier 2: CPU DRAM (per node)                                   |
|   Capacity: 1 to 2 TB per node                                |
|   Bandwidth: ~400 GB/s                                        |
|   Role: warm KV cache, recently-used prefixes                 |
+---------------------------------------------------------------+
                          |  RDMA / InfiniBand (200 to 400 Gb/s)
                          v
+---------------------------------------------------------------+
| Tier 3: Cluster NVMe over RDMA                                |
|   Capacity: 10 to 100+ TB per node, PB cluster-wide           |
|   Bandwidth: many GB/s aggregate (parallel reads)             |
|   Role: cold KV cache, durable cluster-wide prefix store      |
+---------------------------------------------------------------+

Three properties make this practical rather than theoretical in 2026:

PCIe is fast enough for the warm tier. Moving a 320 MB KV chunk from DRAM to HBM at 50 GB/s takes about 6 ms, comparable to a single decode step. The transfer math works.
RDMA flips the network from bottleneck to enabler. With InfiniBand or RoCE at 400 Gb/s, a remote node's DRAM or NVMe is reachable in tens of microseconds. The cluster behaves more like one giant memory pool than 50 isolated boxes.
Most workloads tolerate the latency. Pulling a prefix from Tier 3 adds tens of milliseconds. For a request that would otherwise re-prefill 2K tokens for 200 ms, the trade is obvious.

The interesting design question is no longer "how much HBM do I need per node?" It is "what is the right tier for each KV chunk at each moment?"

Mooncake, the open-source reference#

Mooncake is the open-source KV cache architecture published by Moonshot AI, the team behind the Kimi chatbot. It is the cleanest public example of how a serious production LLM service handles distributed KV cache, and it is being integrated as a backend for vLLM, SGLang, and other engines. Even if you never run Mooncake itself, understanding it teaches the shape of every serious distributed KV system shipping in 2026.

Three architectural ideas, each load-bearing on its own:

1. Disaggregated prefill and decode. Most stacks run prefill and decode on the same nodes, in the same forward-pass loop. That is simple but inefficient. Prefill is compute-bound and likes large parallel forward passes. Decode is memory-bandwidth-bound and likes many small steps with high concurrency. They want different hardware shapes. Mooncake splits them across separate node pools. Prefill nodes process the entire prompt and emit a KV cache. Decode nodes pick up the KV cache and run the generation loop. KV cache crosses the boundary over RDMA.

2. The Mooncake Store. A distributed key-value store designed specifically for KV cache. It exposes a cluster-wide pool of DRAM and NVMe, addressed by prefix hash. Any node can write a KV chunk into the Store; any other node can read it back. The Store handles replication, eviction, and locality so the inference engines do not have to.

3. KV cache as a first-class resource. The biggest shift in framing. In a traditional stack, KV cache is internal state inside the inference engine, allocated and freed per request. In Mooncake, KV cache is a durable, addressable, cluster-wide resource with its own lifecycle, independent of any single request or any single node.

A simplified request path:

# Disaggregated KV-cache request path (simplified)
def handle_request(req):
    # 1. Look up reusable prefix across the cluster
    prefix_hash = hash_prefix(req.prompt)
    cached_chunk = mooncake_store.get(prefix_hash)

    if cached_chunk:
        # 2a. Prefix hit: skip prefill entirely
        kv = cached_chunk
        new_tokens = req.prompt[cached_chunk.length:]
    else:
        # 2b. Prefix miss: route to a prefill node
        prefill_node = router.pick_prefill_node(load_aware=True)
        kv = prefill_node.run_prefill(req.prompt)
        mooncake_store.put(prefix_hash, kv)
        new_tokens = []

    # 3. Pick a decode node, locality-aware
    decode_node = router.pick_decode_node(kv_location=kv.physical_node)

    # 4. Stream KV to decode HBM via RDMA, run generation
    decode_node.attach_kv(kv)
    for token in decode_node.generate(new_tokens, req.params):
        yield token

The decision unit shifts from "which node should I send this request to" to "which KV chunk should I attach to which node." Routing is now a serious piece of infrastructure with its own SLOs.

Prefix caching is where the money actually is#

Disaggregated KV cache is interesting. Prefix caching at scale is the part that pays for it.

Production traffic is dominated by repeated prefixes. The common patterns:

System prompts. A 2K-token system prompt repeated across every chatbot conversation. At 10K conversations per hour, that is 20M tokens of redundant prefill.
RAG context. Same top-K chunks retrieved for any user asking about the same topic. Multi-turn conversations on the same topic repeat the same retrieved chunks at the start of every follow-up.
Agent loops. Iteration N has prepended the entire plan, tool calls, and observations of iterations 1 through N-1. Iterations N+1 and N+2 reuse the same growing prefix.
Few-shot examples. Workflows pinning 5 to 20 example dialogues at the front of every request.

Naive setup: each request triggers a full prefill. Disaggregated setup: the prefix is computed once and reused across every request that shares it, anywhere in the cluster.

Worked example. A customer support chatbot serves 10K conversations per hour. Average system prompt: 2K tokens. Average user-specific content per request: 300 tokens. Average response: 200 tokens.

Without cluster-wide prefix caching:

Per-request prefill: 2300 input tokens.
Per-hour total: 10K × 2300 = 23M input tokens of prefill compute.

With cluster-wide prefix caching:

System prompt prefilled once per cluster (effectively free after that).
Per-request prefill: 300 input tokens.
Per-hour total: 2K (one-time) + 10K × 300 = 3M input tokens of prefill compute.

That is roughly an 8x reduction in prefill compute. Prefill is compute-bound and a meaningful chunk of total cost, so this maps directly to GPU bills. For a workload spending $200K per month on prefill, that is about $175K per month recovered. Customers paying per-token APIs see this in the bill as a "cached input tokens" line, typically 50 to 90 percent off the standard input price.

KEY CONCEPT

For workloads where 80%+ of input tokens are shared prefixes (chatbots, RAG, agents), cluster-wide prefix caching is a 5x to 10x cost reduction compared to a naive setup. If your serving stack does not cache prefixes across nodes, you are paying for compute you did not need to do. That is the single biggest cost lever in production LLM serving in 2026.

What this looks like in vLLM today#

You do not have to deploy Mooncake to get the cluster-wide benefit. The vLLM ecosystem has converged on a few practical retrofit options, in order of operational complexity.

Option 1: Sticky routing on top of vLLM prefix caching.

The cheapest move. Keep vLLM's automatic prefix caching enabled. Replace your round-robin load balancer with a router that hashes the prefix (or just the first N tokens) and routes consistently. Pods that already have the prefix in their KV cache see the hit rate climb dramatically.

This captures 60 to 80 percent of the cluster-wide upside at 10 percent of the operational cost. It is the right starting point for most teams.

Implementation: a small custom Envoy filter, or a Layer 7 router (Higress, BunkerWeb, custom) that reads the request body, hashes the system prompt + first N user tokens, and applies consistent hashing across the pod pool. Falls back to least-loaded routing when a hash bucket is hot.

Option 2: vLLM with a CPU offload prefix cache.

vLLM 0.7+ supports offloading KV pages to host DRAM via the swap_space config (more recently via the explicit prefix-cache offload knob). Per-pod, this lets a single vLLM instance hold a much larger prefix cache than fits in HBM alone. Still per-pod (no cross-pod sharing), but the larger working set means a single sticky-routed pod can serve a much wider tail of prefixes.

Implementation: enable swap_space (4 to 16 GB per pod is typical) and tune gpu-memory-utilization to leave headroom. The full procedure for picking the right utilization is in Tuning vLLM gpu_memory_utilization Without Breaking Production.

Option 3: External KV cache backend (Mooncake, LMCache, custom).

The full disaggregation play. vLLM exposes a KV connector interface (the LMCacheConnector is one production-grade implementation; Mooncake plugs in similarly). Behind the connector is a cluster-wide KV store (DRAM and NVMe pooled across nodes, accessed over RDMA). Every vLLM pod reads and writes through the connector. Prefix hits are cluster-wide. Cache lives independently of any single pod's lifecycle.

This is the option that delivers the full 5x to 10x cost reduction. It is also the one with the operational surface area: a distributed cache to run, an RDMA fabric to depend on, observability to add, failure modes to handle.

Most teams should walk this in order: Option 1 first (cheap, fast, high ROI), Option 2 if you have prefix patterns that overflow a single pod's HBM, Option 3 once you are at a scale where the operational cost is justified by the GPU savings.

The operational reality#

Six concerns separate teams that ship distributed KV cache successfully from teams that ship it once and roll it back.

Cache invalidation, worse than usual. Standard LRU breaks. A 2K-token system prompt used by 100K requests per hour is vastly more valuable than a unique 30K-token prompt used once. LRU would evict the system prompt because it was used "least recently," missing that it is about to be used again. You need workload-aware eviction: weighted by hit rate, by size, by recompute cost.

Model version consistency. A model checkpoint update invalidates every cached KV entry tied to that model. Hot-swapping models across the cluster requires either a global flush (painful, big spike) or a generational tag on every cache entry (correct, more complex). Get this wrong and you serve KV cache from one model into another model's decode loop. The outputs are complete garbage.

Network bandwidth is the new HBM. A 400 Gb/s RDMA link sustains roughly 50 GB/s of useful payload. A 30K-token KV chunk for a 70B model is around 10 GB; that takes 200 ms to move. If you naively shuffle KV across the cluster on every routing decision, you spend more time on transfers than on compute. Topology-aware routing (prefer the decode node closest to the KV chunk) becomes mandatory, not optional.

Locality-aware scheduling. Routing matters more than in any stateless web app you have ever run. Sending a request to the node that already has the prefix cached can save 10x to 100x the work. Sending it to a random node forces a transfer or a recomputation. The router is now critical infrastructure with its own SLOs.

Failure modes multiply. A Tier 3 node dies mid-request and the request needs its KV cache. Network partition cuts decode from prefill. Cache corruption silently serves wrong tokens. Each scenario needs a recovery story (typically: fall back to recompute, mark the chunk invalid, page operator). None of these existed when KV cache was a per-process allocation.

Observability you did not need last year. New first-class metrics: cache hit rate by tier, prefix reuse distribution, eviction pressure, transfer queue depth, RDMA throughput utilization, time from cache lookup to KV in HBM. Old metrics still matter (TTFT, ITL) but they are now downstream of cache behavior, not independent. A drop in cache hit rate predicts a TTFT spike before the TTFT spike happens.

WAR STORY

A team I worked with rolled out a disaggregated KV cache layer to cut serving costs by an expected 4x. The cost graph showed the savings within a week. The latency graph showed a problem: p99 TTFT had quietly gotten worse. The cause turned out to be a router that was too aggressive about reusing remote KV chunks. For requests where the prefix lived on a far node, the RDMA transfer took longer than a fresh prefill would have on the local decode node. The router was optimizing for cache hit rate, not user-visible latency. The fix was a routing rule that compared estimated transfer time against estimated recompute time and picked the faster one, per request. Hit rate dropped slightly. p99 TTFT recovered. Lesson: cache hit rate is a means, not an end. The metric that ships is the one users feel.

When to do this, and when not to#

This architecture is powerful and operationally expensive. Like any distributed systems shift, it is the right answer in specific regimes.

You should reach for disaggregated KV cache if:

You serve LLMs at meaningful scale (100K+ requests per day, typically more).
Your workload has strong prefix repetition (system prompts, RAG, agents, few-shot).
You run long-context workloads regularly (32K+ tokens are common, not exceptional).
GPU cost is a meaningful line item your CFO actually looks at.
You have a platform team that can operate distributed cache infrastructure.

You should not reach for this if:

You serve fewer than a few thousand requests per day. Single-node optimizations are still in front of you.
Your prompts are highly varied with little repetition. Prefix caching has nothing to cache.
You use a managed inference API (OpenAI, Anthropic, Bedrock). Let the provider handle it; consume the cached-prompt discount.
You do not have the engineering capacity to operate a distributed cache layer.

The reasonable middle ground for most teams is Option 1 above: single-node vLLM with prefix caching enabled, plus a sticky-hash router across pods. That captures most of the benefit at a small fraction of the operational cost. Reach for full disaggregation when you can show, with metrics, that intra-node caching is no longer enough.

Common mistakes#

Throwing more GPUs at it. Adding HBM solves nothing structural and burns budget. The cost curve never bends.
Treating prefix caching as a flag someone toggled once. It is often the single biggest lever. Make it a first-class feature, with metrics and SLOs.
Round-robin routing. Erases most of the cache benefit. The router becomes the most important piece of infrastructure once cache is shared.
Naive LRU eviction. Some prefixes are 1000x more valuable than others. Eviction policy has to know that.
Mixing model versions across cached KV. Silent correctness bug. Tag every cache entry with model version and refuse mismatched reads.
Aggregating cache metrics. Average hit rate hides a workload that has 99 percent hit rate on chatbot traffic and 5 percent hit rate on long-tail RAG queries. Slice by workload.
Optimizing hit rate instead of user latency. A cache hit that requires a 400 ms remote transfer is worse than a local recompute. The metric that ships is end-to-end latency.

The mental model#

LLM inference is following the same maturation curve every other infrastructure problem has followed. Databases went from single-node to distributed when single-node could not hold the working set. Storage went from local disks to object stores with caching tiers when single-disk economics broke. Web serving went from one application server to stateless fleets, then to stateful service meshes when latency and locality started mattering. Each evolution had the same shape: take a thing that lived inside one box, make it a first-class distributed resource with its own lifecycle, reorganize the stack around it.

LLM inference is at exactly that inflection point. KV cache, which lived inside the inference engine, is becoming a first-class distributed resource. Prefill and decode, which lived in the same process, are becoming separate node pools with different hardware shapes. Routing, which used to be round-robin, is becoming locality-aware scheduling driven by where bytes already are.

If you are at the KV cache wall today, the answer is almost certainly not more HBM. It is the architectural shift. Start with sticky routing. Move to a CPU offload prefix cache if the workload demands it. Reach for full Mooncake-style disaggregation when the GPU savings justify the operational cost. The wall is real, and the shape of what comes next is now well-mapped.

The full architectural reasoning, including the math, the memory hierarchy, and the interview-grade trade-off analysis, is the KV Cache Architectures lesson in the LLM Operations course. The single-node optimizations that come before this (PagedAttention, continuous batching, vLLM tuning) are covered across Production GPU Infrastructure and LLM Inference on Kubernetes. Related reading: GPU Memory Fragmentation Explained for why your "free" HBM is not free, Tuning vLLM gpu_memory_utilization Without Breaking Production for the knob you tune before you reach for disaggregation, vLLM vs SGLang for Production in 2026 for the engine-choice decision one layer above this, MIG vs Time-Slicing on Kubernetes for the GPU-partitioning decision one layer below, Prompt Economics for the token cost model that makes cluster-wide prefix caching worth the operational cost, Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire. for using KV-cache pressure as a leading autoscaling signal before preemption hits, and Your GPU Finishes a Request and Waits for the Slowest for the continuous-batching scheduler that the paged KV cache was co-designed to make possible.

More in LLM Infrastructure

LLM Infrastructure·Jun 10, 2026·13 min read

You Changed the Prompt. Is the Model Better or Worse? You Don't Have a Test That Tells You.

Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.

Read post

LLM Infrastructure·Jun 9, 2026·16 min read

kubectl drain Killed a 90-Second Inference Request. Stateless Drain Logic Doesn't Work for GPU Pods.

Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs (a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts), plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.

Read post

LLM Infrastructure·May 31, 2026·14 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

Read post