All posts
GPU Infrastructure

Your 8B Model Won't Fit on an A100 With 50GB Free. Welcome to GPU Memory Fragmentation.

The model weights are 16GB. The KV cache is 20GB. The A100 has 80GB. nvidia-smi shows 50GB free. The next request OOMs. The CUDA memory allocator's fragmentation story most ML engineers never learn.

By Sharon Sahadevan··11 min read

You deploy Llama-3.1-8B for inference on an A100 with 80GB of HBM. Weights in fp16 are 16GB. Activations and KV cache for a batch of 8 long requests at sequence length 8K should be ~20GB. Total budget: ~36GB. You have 80GB. You expect a comfortable factor of 2x in headroom.

In production, requests start failing with:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.04 GiB.
GPU 0 has a total capacity of 79.15 GiB of which 1.42 GiB is free.
Including non-PyTorch memory, this process has 77.72 GiB memory in use.
Of the allocated memory 67.10 GiB is allocated by PyTorch,
and 9.50 GiB is reserved by PyTorch but unallocated.

nvidia-smi says 67GB used, 13GB free. Your math says you need 36GB. PyTorch tried to allocate 2GB and there is allegedly 13GB free, but the allocation failed.

Welcome to GPU memory fragmentation. The free memory is real, but it is split into pieces too small for any single allocation. The 9.5GB "reserved by PyTorch but unallocated" is the smoking gun.

This post is the GPU memory model: what nvidia-smi actually shows, why PyTorch's caching allocator creates fragmentation, the special pain of the LLM KV cache, and how PagedAttention (vLLM, SGLang) makes the problem mostly go away.

What nvidia-smi actually shows#

When you check GPU memory:

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    23456      C   /usr/bin/python3                            67.1GiB |
+---------------------------------------------------------------------------------------+

The 67.1GiB is what the CUDA driver reports as allocated to your process via cudaMalloc. Three things to know about this number:

1. It is the union of all cudaMalloc calls minus cudaFree calls. The CUDA driver tracks allocations per-process at the page level. When you free, the driver returns the pages to the GPU's free pool.

2. PyTorch's caching allocator does not call cudaFree when you del a tensor. Instead, PyTorch keeps the freed memory in its own pool, ready for the next allocation. From CUDA's perspective (and nvidia-smi's), that memory is still "in use" by your process.

3. "Used" memory is at the granularity of a CUDA memory page, typically 2MB. A 1KB allocation reserves a full 2MB chunk.

When PyTorch reports 67.10 GiB allocated by PyTorch, and 9.50 GiB reserved by PyTorch but unallocated, the breakdown is:

  • 67.10 GiB: tensors that exist and have data.
  • 9.50 GiB: cached blocks that PyTorch released back to its pool but did not return to CUDA. These are reusable for new tensors of compatible sizes.
  • The 9.50 GiB shows in nvidia-smi as part of your process's memory.

The 9.50 GiB is "free" in the sense that PyTorch could use it, but only for allocations that fit in the cached blocks. A 2GB allocation can use a 2GB cached block but not a thousand 2MB cached blocks. This is the fragmentation problem.

How fragmentation happens in LLM inference#

Imagine the lifecycle of an LLM serving worker:

  1. Startup: load model weights. 16GB allocated as one large contiguous block (typically). PyTorch keeps the underlying CUDA pages.

  2. First request: allocate KV cache for sequence length 4K, batch size 1. ~250MB allocated. Forward pass runs; intermediate activations of various sizes flow through. Each layer creates and frees activation tensors of different shapes.

  3. Request finishes: KV cache freed (back to PyTorch pool, not CUDA). Activations all freed. Pool now has many cached blocks of various sizes.

  4. Second request, longer: needs a 2GB KV cache. PyTorch checks the pool. There are many cached blocks but the largest contiguous one is 800MB (because the pool is fragmented from the first request's varied activations). PyTorch calls cudaMalloc for a fresh 2GB block. Total CUDA-visible usage grows.

  5. Hundreds of requests later: total CUDA usage is 70GB. PyTorch's pool has 9GB cached, but the largest contiguous block is 1.8GB. A 2GB allocation fails even though "free" memory is plentiful.

The fragmentation arises because:

  • LLM KV caches are large, contiguous, and per-request.
  • Activations vary in size (sequence length, batch size).
  • The allocator does not compact: once a chunk is allocated, it stays where it is.

Compare to a trained-model serving workload where allocations are uniform and recurring (same batch shape every time): minimal fragmentation, the cache stabilizes after warmup, and PyTorch's pool is efficient. LLM inference is the worst case for caching allocators because every request can be a different size.

The KV cache: where most of the fragmentation comes from#

For a transformer with $L$ layers, $H$ attention heads, head dimension $d_h$, in fp16, with sequence length $T$, the KV cache size per request is:

2 * L * H * d_h * T * 2 bytes

For Llama-3.1-8B (32 layers, 8 KV heads, 128 head dim, fp16):

2 * 32 * 8 * 128 * T * 2 = 131,072 * T bytes ≈ 128 KiB per token

A 4K-token request: 512 MiB. An 8K-token request: 1 GiB. A 32K-token request: 4 GiB.

For a serving worker handling 16 concurrent requests with mixed sequence lengths, KV cache totals are typically 8-32 GiB and they are constantly being allocated and freed as requests come and go.

This is the dominant source of fragmentation in LLM inference workloads. The model weights are static (load once, keep forever). The activations are short-lived and small enough to live in the allocator's pool without much harm. The KV cache is the right size and shape to fragment everything: large, contiguous, request-lifetime allocations of varying size.

The fix that changed everything: PagedAttention#

vLLM (and now SGLang, and now most modern LLM inference engines) use PagedAttention, which solves the KV cache fragmentation problem with the same idea operating systems use for virtual memory: split the cache into fixed-size pages and use a page table to look up which pages belong to which request.

Concretely:

  • Allocate one giant block of GPU memory at startup (e.g., 50GB out of 80GB) as the "KV cache pool."
  • Divide it into fixed-size pages (typically 16 tokens worth of KV per page).
  • Maintain a page table per request mapping logical token positions to physical page indices.
  • When a request needs a new token's KV, allocate one free page from the pool.
  • When a request finishes, return its pages to the pool.

The page table costs a few bytes per token (compared to the KV's 128KB per token, negligible). The pages can be reused arbitrarily because they are all the same size and aligned. No fragmentation: the pool's free list is a list of free page indices, not a free-block list with size constraints.

For LLM serving, PagedAttention recovers ~30-50% of GPU memory that fragmentation otherwise wastes. A worker that could only handle 8 concurrent requests with the naive allocator can handle 16-20 with PagedAttention.

# vLLM example
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.90,   # use 90% of GPU memory
    max_model_len=8192,
)

gpu_memory_utilization is meaningful because the engine knows how much of the GPU it controls. The remaining 10% is for activations, working memory, and CUDA's own overhead.

Other techniques that reduce fragmentation#

PagedAttention is the heavyweight solution. There are lighter-weight techniques that help:

1. Set PyTorch's max-split-size. This tells the caching allocator not to split blocks above a certain size, reducing fragmentation in mixed-size workloads:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

2. Use CUDA's expandable segments mode. PyTorch 2.1+ supports expandable_segments:True, which uses a different allocator strategy that resists fragmentation:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

3. Periodically empty the cache. torch.cuda.empty_cache() returns cached blocks to CUDA. This can compact the free pool. Use sparingly because the next allocation pays a cudaMalloc cost.

4. Pin batch shapes. Run inference at fixed batch and sequence sizes when possible. Eliminates the variability that causes fragmentation. Used in some training and bulk-inference workflows; impractical for general serving.

For training, where memory pressure comes from gradient and activation memory rather than per-request KV cache, the techniques are different (gradient checkpointing, ZeRO partitioning, FlashAttention) but the fragmentation principle is the same.

How to diagnose fragmentation in production#

# In your serving code, expose memory stats
import torch

stats = torch.cuda.memory_stats()
print(f"allocated: {stats['allocated_bytes.all.current'] / 1e9:.2f} GB")
print(f"reserved:  {stats['reserved_bytes.all.current'] / 1e9:.2f} GB")
print(f"largest free block: ", end="")
print(f"{torch.cuda.mem_get_info()[0] / 1e9:.2f} GB free of "
      f"{torch.cuda.mem_get_info()[1] / 1e9:.2f} GB total")

# Detailed breakdown
print(torch.cuda.memory_summary())

The key ratios:

  • allocated / reserved: if this is below 0.7, you have significant fragmentation. PyTorch reserved memory but it is not being used.
  • (reserved - allocated): the cached-but-fragmented memory. If this is more than a few GB, fragmentation is biting.

Expose these as Prometheus metrics:

# Fragmentation ratio
gpu_memory_allocated_bytes / gpu_memory_reserved_bytes

# Cached fragmented memory
gpu_memory_reserved_bytes - gpu_memory_allocated_bytes

Alert when the ratio drops below 0.6 or fragmented memory exceeds 10GB. These thresholds catch problems before OOMs.

MIG: a different fragmentation answer#

NVIDIA's Multi-Instance GPU (MIG) lets you partition an A100/H100 into smaller virtual GPUs (e.g., 7 instances of ~10GB each on an A100-80GB). Each MIG slice is hardware-isolated; each runs its own PyTorch process with its own allocator.

For LLM workloads, MIG trades flexibility for predictability:

  • Smaller models that fit in a slice run with no fragmentation pressure (each slice is small enough that allocations fit easily).
  • Larger models cannot benefit (they need full GPU).

For multi-tenant or multi-model serving, MIG is the right answer: physical isolation eliminates the noisy-neighbor and shared-allocator problems entirely. For a single large model with many concurrent requests, PagedAttention on a whole GPU is the better fit.

Common mistakes#

1. Trusting nvidia-smi for "free" memory. It shows what CUDA driver reports, which includes PyTorch's cached but unallocated memory. Use torch.cuda.memory_stats() for the truth.

2. Sizing for "weights + KV cache" without overhead. Activations, optimizer state (during training), CUDA's own overhead, fragmentation slack: budget 20% extra.

3. Running multiple PyTorch processes on the same GPU. Each has its own caching allocator. They fight for memory and fragment independently. Use MIG or run one process per GPU.

4. Setting gpu_memory_utilization=0.95 in vLLM. Leaves only 5% for non-pool memory. Out-of-pool allocations (CUDA graphs, intermediate tensors) fail. 0.85-0.90 is typically right.

5. Not testing with realistic request mixes. A benchmark with all 4K-token requests does not exhibit the fragmentation that production (mixed lengths) does. Load tests should include short, medium, and long requests.

6. Forgetting kv_cache_dtype. vLLM supports fp8 KV cache (kv_cache_dtype="fp8"), which halves the KV cache size with minimal quality impact. Same fragmentation, but smaller pages = more requests fit.

7. Restarting the worker as a "fix". A restart returns all CUDA memory to the OS, fixing the immediate OOM but not the underlying issue. Worker becomes flaky. Fix the allocator config or move to PagedAttention.

8. CUDA graphs interacting badly with the cache. CUDA graphs capture allocations. If captured during a fragmented state, the graph holds on to fragmented memory across replay. vLLM's enforce_eager=True disables graphs (slightly slower but predictable).

Quick reference: the GPU OOM checklist#

1. Verify the symptom:
   - "CUDA out of memory" with X GB available
   - PyTorch reports reserved >> allocated
   - nvidia-smi shows process using most of GPU

2. Check fragmentation:
   torch.cuda.memory_summary()
   - Look at "Largest allocated block" vs "free memory"
   - If a small block is the largest, you are fragmented

3. For LLM inference: just use vLLM/SGLang.
   - PagedAttention solves this categorically
   - gpu_memory_utilization=0.90 is the standard

4. For training or other workloads:
   - export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
   - Or max_split_size_mb tuned for your workload
   - Pin batch shapes if practical

5. For multi-tenant: use MIG to physically isolate.

6. Right-size:
   - Model weights: known (parameters * dtype size)
   - KV cache: tokens * 2 * L * H * d_h * dtype size
   - Activations: 10-20% extra
   - Fragmentation slack: 10-20% extra (less with PagedAttention)

7. Monitor:
   - allocated/reserved ratio < 0.6 = fragmenting
   - alerts on torch OOM events (count + size requested)

The mental model#

GPU memory is not an unstructured pool. It is allocated in pages by CUDA, then re-pooled by PyTorch, then partitioned by the workload. Each layer can fragment.

For LLM inference, the dominant fragmentation source is the KV cache: large, contiguous, per-request, varying-sized allocations. The naive answer (give every request its own contiguous block) wastes 30-50% of GPU memory. PagedAttention's answer (a single pool, fixed-size pages, indirect addressing) is the operating-systems answer applied to GPU memory, and it is correct.

If you are running LLM inference on PyTorch directly, you are paying the fragmentation tax. Move to vLLM, SGLang, or TensorRT-LLM and you stop. The capacity you recover usually pays for the migration in days.

For non-LLM workloads, fragmentation is less severe but still present. The PyTorch allocator config knobs (expandable_segments, max_split_size_mb) are worth knowing. Most production teams never touch them and quietly leave 20% of their GPUs on the table.


The GPU memory model, CUDA mechanics, and inference engine internals are covered in the Production GPU Infrastructure course. The LLM-specific serving patterns (PagedAttention, continuous batching, prefix caching, speculative decoding) are the spine of the LLM Inference on Kubernetes course.