Tuning vLLM gpu_memory_utilization Without Breaking Production
The default 0.9 is wrong for almost every production deployment. Here's how to pick the right number for your model, GPU, and traffic shape.
If you've deployed vLLM to Kubernetes more than once, you've probably hit one of two failure modes:
- The pod takes 90 seconds to start, then immediately CUDA-OOMs and CrashLoopBackOffs.
- The pod runs fine for hours, then dies under traffic when an unusually long prompt arrives.
Both have the same root cause: gpu_memory_utilization is set wrong. This post is about how to pick the right value, why the default of 0.9 is too aggressive for most production deployments, and the tuning loop that lets you converge on a number you trust.
What gpu_memory_utilization actually controls#
vLLM's gpu_memory_utilization is a fraction (0.0-1.0) of total GPU memory that vLLM is allowed to use for everything: model weights, activations, the CUDA graph, and, critically, the KV cache.
When the engine starts, it does roughly this:
- Read total GPU memory from CUDA (e.g., 80 GiB on an H100 80 GB).
- Multiply by
gpu_memory_utilization(0.9 → 72 GiB budget). - Load the model weights (e.g., 14 GiB for a 7B model in bf16, 140 GiB for a 70B model, which won't fit on one card, that's a different post).
- Allocate scratch space for activations and the CUDA graph (~2-4 GiB depending on max batch size).
- Use everything that's left as the KV cache pool.
So for a 7B bf16 model on an H100 80 GB with gpu_memory_utilization=0.9:
Total budget: 72 GiB
Model weights: 14 GiB
Activations/graph: 3 GiB
KV cache pool: 55 GiB
That 55 GiB of KV cache is what determines how many concurrent sequences you can serve and how long they can be. It is also the biggest variable in your throughput.
The number you actually care about is the KV cache pool size, not gpu_memory_utilization directly. gpu_memory_utilization is just a knob that, combined with model size, controls the residual KV cache budget. Model bigger? KV cache shrinks. Utilization higher? KV cache grows, but so does the risk of CUDA OOM.
Why 0.9 is too aggressive in production#
The default of 0.9 assumes:
- You are the only thing using the GPU.
- The GPU has no other CUDA contexts.
- The driver overhead is small and constant.
- No other processes will spawn during your pod's lifetime.
In a Kubernetes production environment, none of these are reliably true. The unbudgeted memory consumers I've seen take down vLLM pods running at 0.9:
- Driver and CUDA context overhead. Roughly 300-800 MiB per GPU. Variable across driver versions. Goes up after kernel upgrades.
- NCCL buffers (if using tensor parallelism). 200 MiB to several GiB depending on world size and message sizes.
- DCGM exporter running on the same GPU via
nvidia-smipolling. Negligible memory but it counts as another CUDA context, and the count of contexts itself has overhead. - Other vLLM workers during a rolling update. If
maxSurge: 1, you briefly have two pods on the same GPU until the old one drains, both trying to allocate at 0.9. The new one will OOM. - Page-locked memory pinned by other pods on the node. Not GPU memory directly, but if the node is under host RAM pressure, some CUDA allocations can fail unpredictably.
The combination of these means the "free" memory CUDA reports at process start is sometimes 1-3 GiB less than total. Set gpu_memory_utilization=0.9, allocate the full 72 GiB budget eagerly (vLLM does), and you OOM on the activations or first KV block.
The right value depends on three things#
There is no universal answer. The right value depends on:
- What else lives on the GPU. If you have MIG slices, dedicated GPU per pod, or other CUDA workloads sharing the device.
- How long your
max_model_lenis. A bigger context means each KV block is bigger; you can tolerate less margin. - How aggressive your scheduler is. vLLM's scheduler pre-allocates KV blocks for in-flight requests. Higher concurrency = more outstanding blocks = less slack to absorb a sudden spike.
A reasonable starting matrix:
| Scenario | Suggested gpu_memory_utilization |
|---|---|
Dedicated GPU, single tenant, fixed max_model_len | 0.85 |
| Dedicated GPU, mixed traffic (variable prompt lengths) | 0.80 |
| Sharing the device with other CUDA workloads | 0.70 |
| MIG slices (each slice has its own memory) | 0.85 (per slice) |
| Tensor-parallel across N GPUs (NCCL overhead) | 0.80 |
| Rolling deploy on the same GPU | 0.65 (temporarily, during rollout window) |
These are starting points, not answers. Use them as a hypothesis, then run the tuning loop below.
The tuning procedure#
This is the loop I run for every new model + GPU combination. It's mechanical, takes about an hour, and converges to a number you can defend.
1. Start conservative. Set gpu_memory_utilization=0.70. Get the pod healthy and serving.
2. Capture baseline KV cache size. vLLM logs the number of KV cache blocks at startup:
INFO 04-24 14:22:11 worker.py:142] # GPU blocks: 12450, # CPU blocks: 0
Each block is block_size * num_kv_heads * head_dim * 2 bytes. Note this number, it's your throughput ceiling.
3. Run a load test that exercises peak prompt length. If your real traffic has prompts up to 4096 tokens with completions up to 1024 tokens and concurrency of 32, generate exactly that. Use vLLM's benchmark script or equivalent.
4. Watch nvidia-smi --query-gpu=memory.used --format=csv -l 1 during the test. Record peak memory.
5. Compute headroom. headroom = total_gpu_memory - peak_memory_during_test. You want at least 2 GiB of headroom for an H100, 1 GiB for an A10/L4, more if you share the GPU with anything.
6. Bump gpu_memory_utilization up by 0.05 if headroom is bigger than your target. Re-run the load test. Confirm KV blocks went up (more throughput) and headroom is still acceptable.
7. Stop when headroom equals target, or when you OOM. If you OOM, drop back 0.05 and call it done.
8. Run a 24-hour soak at production traffic shape. This catches the rare-but-real long-tail prompts that no synthetic test reproduces.
The whole loop is what I run for every new (model, GPU, max_model_len) tuple, it's the same procedure we walk through hands-on in Production LLM Inference on Kubernetes, with real H100 numbers and full benchmark configs.
A team I worked with ran Llama-3-70B on 4× H100s (TP=4) at gpu_memory_utilization=0.92. Worked fine in staging, worked fine in canary, OOMed in prod after exactly 6 hours every time. The culprit: a Datadog GPU-monitoring agent had been added to the prod node pool but not to staging. The agent's CUDA context, about 400 MiB per GPU, was just enough to push peak memory over the line on one GPU during a long-prompt batch. Fix: drop to 0.88, document the agent's overhead, set up a metric for it.
What the wrong value costs you#
If gpu_memory_utilization is too low, you're leaving throughput on the table. Specifically:
- Smaller KV cache → fewer concurrent sequences → lower max QPS.
- More frequent KV preemption (vLLM's mechanism for evicting in-flight sequences when blocks run out) → tail latency spikes.
If it's too high, you OOM. Cost of an OOM in production:
- Pod restart: ~60-90 seconds for a 7B model, several minutes for a 70B model.
- During that window, all in-flight requests fail, your circuit breaker (you have one, right?) opens, and HPA may scale you up further, compounding the problem.
- For the rest of the day, on-call is twitchy and over-provisions everything.
The financial side of getting this right at scale. KV cache sizing, MIG vs dedicated, attribution, is what we cover in GPU Cost Optimization on Kubernetes, with a real case study of cutting GPU spend from $180K to $67K per month without dropping serving capacity.
A few more practical notes#
swap_spacelets vLLM spill KV blocks to CPU memory when the GPU pool is full. This avoids OOM but adds latency. Useful as a safety valve, not a primary strategy.max_num_batched_tokensis the upper bound on concurrent token throughput. If this is set too high relative to KV cache, you'll see preemption even at low real concurrency.enable_chunked_prefillchanges the memory profile significantly, long prompts no longer require their full KV allocation up-front. Re-tune after enabling.enforce_eager=truedisables CUDA graphs and lowers peak memory by ~1 GiB at the cost of throughput. Useful for getting tight on a small GPU; usually a bad trade on H100s.
If you're running vLLM in production and want a steady stream of this kind of operational depth, the Kubenatives newsletter is where I publish field notes weekly, read by ~3,500 engineers running production K8s and ML infra.
The 0.9 default is fine for a laptop demo. For production on shared infrastructure with real traffic and rolling deploys, you almost certainly want something between 0.75 and 0.85, picked by running the loop above. Don't trust the default. Don't guess. Measure.