GPU Infrastructure··11 min read
Your 8B Model Won't Fit on an A100 With 50GB Free. Welcome to GPU Memory Fragmentation.
The model weights are 16GB. The KV cache is 20GB. The A100 has 80GB. nvidia-smi shows 50GB free. The next request OOMs. The CUDA memory allocator's fragmentation story most ML engineers never learn.
Read post