Production GPU Infrastructure on Kubernetes

Why GPUs Are Different

Your team just got approval to run ML inference workloads on Kubernetes. The ML engineers hand you a Docker image and say, "We need 4 A100s per pod, 80GB each."

You've managed Kubernetes clusters for years. You've scheduled thousands of CPU pods. But you've never touched GPUs.

Before you write a single line of YAML, you need a mental model of what you're actually managing. Because when a GPU pod gets stuck in Pending, or crashes with CUDA error: out of memory, or runs 10x slower than expected, the answer isn't in the Kubernetes docs. It's in the hardware.

Part 1: CPU vs GPU: Why the Difference Matters for Scheduling

You already understand CPUs from a Kubernetes perspective. A node has cores, you set requests and limits, the scheduler places pods based on available compute. GPUs seem like they should work the same way. They don't.

The Fundamental Architecture Difference

A CPU is designed for sequential processing. It has a small number of powerful cores (8-128 on modern server CPUs), each capable of running complex, branching logic quickly. A single CPU core can handle an entire web request, run a database query, or manage a Kubernetes controller loop.

A GPU is designed for parallel processing. It has thousands of simpler cores (6,912 CUDA cores on an A100), each capable of performing one arithmetic operation per clock cycle. No single GPU core can run a web server. But all 6,912 of them together can multiply two matrices faster than any CPU on earth.

CPU vs GPU Architecture

CPU (e.g. EPYC 9654)

Optimized for latency, sequential processing

Core count96 powerful cores

Core typeComplex, general-purpose

Cache per coreLarge (L1/L2/L3)

Clock speedHigh (3-5 GHz)

Branch predictionAdvanced

MemoryDDR5 (~200 GB/s)

Best forComplex serial tasks

GPU (e.g. A100)

Optimized for throughput, parallel processing

Core count6,912 simple CUDA cores

Core typeSimple arithmetic units

Cache per coreMinimal (shared per SM)

Clock speedLower (1-2 GHz)

Branch predictionMinimal

MemoryHBM2e (~2,039 GB/s)

Best forMassively parallel math

This distinction shapes how you schedule and manage GPU workloads in Kubernetes:

Property	CPU in Kubernetes	GPU in Kubernetes
Divisible?	Yes, you can request 0.5 CPU	No, minimum 1 whole GPU
Shared?	Yes, multiple pods share cores via time-slicing	No, 1 GPU = 1 pod (by default)
Overcommit?	Yes, requests < limits allows overcommit	No, a GPU is either allocated or not
Compressible?	Yes. CPU-throttled pods still run	No, GPU OOM = immediate crash
Scheduler aware?	Built-in	Requires NVIDIA Device Plugin

KEY CONCEPT

GPUs are non-divisible, non-shared, non-compressible resources in Kubernetes. A pod either gets a full GPU or nothing at all. There is no equivalent of "500m" GPU the way there is for CPU millicores, unless you explicitly configure GPU sharing via MIG partitioning (Module 3) or time-slicing.

This means a single GPU sitting at 5% utilization is wasted capacity that no other pod can use. Unlike CPU, where the kernel time-slices between processes automatically, a GPU allocated to one pod is invisible to all other pods. This is the single most expensive inefficiency in GPU clusters, and it's why Module 3 (MIG Partitioning) exists.

GPUs Break the Kubernetes Resource Model

As a Kubernetes engineer, you're used to thinking in terms of CPU millicores and memory bytes. GPU workloads break this model in several ways.

Memory is on-device. CPU workloads use system RAM. GPU workloads use GPU memory (VRAM), which is physically on the GPU card. An A100 has 40GB or 80GB of HBM2e. Once it's full, your workload doesn't swap to disk, it OOMs and crashes.

# This is NOT how GPU memory works
resources:
  requests:
    nvidia.com/gpu: 1
    # There is no "nvidia.com/gpu-memory: 40Gi" field
    # GPU memory is not a schedulable resource in Kubernetes

WARNING

Kubernetes has no built-in awareness of GPU memory. The scheduler only knows whether a GPU is available or not, it has no idea how much VRAM your workload needs. You must handle GPU memory management at the application level and through careful pod placement.

Driver compatibility is critical. CPU workloads don't care which kernel version you're running (usually). GPU workloads have a tight coupling between the NVIDIA driver installed on the host, the CUDA runtime baked into your container image, and the GPU hardware generation (Ampere, Hopper, etc.).

# Check driver version on a GPU node
nvidia-smi
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2     |
# +-----------------------------------------------------------------------------+

# If your container needs CUDA 12.3 but the driver only supports 12.2,
# your workload will fail with cryptic errors

WAR STORY

We once had a fleet of 200 GPU nodes where 30 had a different driver version due to a failed rolling update. Pods would randomly fail on those nodes with CUDA_ERROR_NO_DEVICE. It took us two days to realize the driver mismatch because nvidia-smi showed the GPU as healthy, but the CUDA compatibility matrix didn't match the container runtime. The fix was a forced DaemonSet rollout of the GPU Operator to resync driver versions across the fleet.

Failure modes are different. CPU cores don't usually "go bad." GPUs do. Common GPU failure modes include:

ECC memory errors: correctable ones are fine, uncorrectable ones mean the GPU needs replacement
Thermal throttling: GPUs run hot (up to 83°C under load) and will slow down or shut off
NVLink errors: if you're using multi-GPU nodes, the interconnect between GPUs can fail
Xid errors: NVIDIA's error reporting system that logs hardware and driver issues

# Check for GPU errors
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv
# ecc.errors.corrected.volatile.total, ecc.errors.uncorrected.volatile.total
# 0, 0

# Check dmesg for Xid errors
dmesg | grep -i "xid"
# NVRM: Xid (PCI:0000:65:00): 79, pid=12345, GPU has fallen off the bus

PRO TIP

Set up alerts for Xid errors in your GPU monitoring stack (Module 7). Xid 79 ("GPU has fallen off the bus") is the one that wakes you up at 3 AM, it means the GPU hardware is failing and needs physical replacement. Power-cycle the node and drain it from the cluster immediately.

Practical Impact on Your Infrastructure

Here's the full picture of what changes when you add GPUs to your cluster:

Aspect	CPU Nodes	GPU Nodes
Cost	$0.10-2/hr	$2-30/hr
Bin packing	Dense, efficient	Sparse, wasteful
Node count	Many small nodes	Few large nodes
Failure impact	Low (reschedule)	High ($$$, scarce)
Driver management	Kernel auto-updates	Careful version control
Monitoring	Standard metrics	GPU-specific metrics (DCGM)
Scheduling	Default scheduler	Custom labels + taints

Part 2: Inside the GPU: The Hardware You're Managing

You don't need to write CUDA code. But you need to understand the hardware well enough to diagnose problems, make scheduling decisions, and have informed conversations with ML engineers. Here's what's inside the GPUs you're deploying.

The Streaming Multiprocessor (SM)

The SM is the basic compute unit of a GPU. An A100 has 108 SMs. Each SM contains:

64 CUDA cores: general-purpose arithmetic units
4 Tensor Cores: specialized units for matrix multiplication (this is what makes modern AI fast)
Shared memory / L1 cache: 192 KB per SM
Warp schedulers: manage groups of 32 threads executing in lockstep

NVIDIA A100 GPU, Internal Architecture

NVIDIA A100 GPU

Streaming Multiprocessor (SM) x108

CUDA Cores: 64 per SM

Tensor Cores: 4 per SM

Shared Memory / L1: 192 KB

Warp Schedulers: 4 (32 threads each)

L2 Cache: 40 MB

HBM2e Memory: 80 GB @ 2,039 GB/s

Totals: 6,912 CUDA Cores | 432 Tensor Cores

Hover components for details

The totals: 108 SMs × 64 CUDA cores = 6,912 CUDA cores. 108 SMs × 4 Tensor Cores = 432 Tensor Cores. All backed by 80 GB of HBM2e memory with 2,039 GB/s bandwidth, roughly 10x faster than DDR5 server memory.

Why K8s Engineers Care About SMs

You care about SMs because of MIG (Multi-Instance GPU). MIG lets you partition a single A100 into up to 7 isolated GPU instances, each with its own SMs, memory, and cache. Instead of wasting an entire 80GB A100 on a small inference workload, you carve out a 10GB slice.

Each MIG instance gets a proportional share of SMs:

MIG Profile	SMs	VRAM	Use Case
1g.10gb	14	10 GB	Small model inference
2g.20gb	28	20 GB	Medium model inference
3g.40gb	42	40 GB	Large model inference
4g.40gb	56	40 GB	Training, large inference
7g.80gb	98	80 GB	Full GPU (nearly)

KEY CONCEPT

MIG partitioning is how you solve the "wasted GPU" problem. A single A100 running a small 7B model at 5% utilization is burning $30+/hr of compute. With MIG, you can run 7 small inference workloads on one GPU, each fully isolated with guaranteed compute and memory. We cover MIG in detail in Module 3.

The Three Types of Cores

Modern NVIDIA GPUs have three types of cores, and each serves a different purpose:

CUDA Cores (6,912 on A100) General-purpose arithmetic. One CUDA core does one floating-point operation per clock cycle. These handle everything from basic math to graphics rendering. For ML, CUDA cores handle operations that aren't matrix multiplications: activation functions, normalization, data preprocessing.

Tensor Cores (432 on A100) Specialized matrix multiplication units. A single Tensor Core can perform a 4×4 matrix multiply-and-accumulate in one clock cycle, an operation that would take a CUDA core 64 cycles. This is why modern AI training is fast.

Tensor Cores also support mixed precision: they can take FP16 or BF16 inputs, multiply them, and accumulate in FP32. This is critical because:

FP16 models use half the VRAM of FP32 models
Tensor Cores operate at 2x speed with FP16
The accuracy loss is negligible for most inference workloads

PRO TIP

When your ML team says they're using "mixed precision" or "BF16 inference," they're leveraging Tensor Cores. This is a good thing, it means they're using half the VRAM and getting 2x the throughput. Encourage it. If they're running in FP32, they're leaving performance on the table.

RT Cores (not relevant for ML) Ray tracing cores. You'll only encounter these in gaming/rendering contexts. Ignore them for ML workloads.

CUDA Cores vs Tensor Cores

CUDA Cores

General-purpose arithmetic

Count (A100)6,912

Operation1 FP op per clock cycle

Used forActivations, normalization, preprocessing

Matrix multiply64 cycles for 4×4 matmul

PrecisionFP32, FP64

Tensor Cores

Matrix multiplication specialists

Count (A100)432

Operation4×4 matmul per clock cycle

Used forMatrix multiplications (the bulk of AI)

Matrix multiply1 cycle for 4×4 matmul (64x faster)

PrecisionFP16, BF16, TF32, INT8, FP8 (Hopper+)

The GPU Software Stack

Understanding how the software layers map to containers vs host is critical for debugging:

GPU Software Stack, Container vs Host

Your Application (PyTorch, vLLM, TensorRT)

Lives in the container. This is what your ML engineers build and hand to you as a Docker image.

CUDA Libraries (cuDNN, cuBLAS, NCCL, TensorRT)

Deep learning primitives, linear algebra, multi-GPU communication. Bundled in the container image.

CUDA Runtime API (libcudart.so)

High-level API included in the container image. Version must be ≤ what the host driver supports. This is the version from nvcc --version.

CUDA Driver API (libcuda.so)

Bind-mounted from the host into the container by nvidia-container-runtime at container creation time. The version shown in nvidia-smi.

NVIDIA Kernel Module (nvidia.ko)

Installed on the host node (or managed by GPU Operator). Talks directly to GPU hardware. This is the driver.

GPU Hardware (A100, H100, L4, T4)

The physical GPU. Each generation has different capabilities (compute capability), memory, and performance characteristics.

Hover to expand each layer

The key insight: the CUDA runtime lives in your container, but the driver lives on the host. This split is what makes GPU Kubernetes harder than regular Kubernetes. You need to ensure compatibility across container images (CUDA runtime version), node configuration (driver version), and hardware (GPU generation).

WARNING

The "CUDA Version" shown in nvidia-smi is NOT the CUDA version installed, it's the maximum CUDA runtime version the host driver supports. Your container can use any CUDA version up to that number. This trips up almost everyone the first time they debug a GPU pod.

Part 3: The NVIDIA GPU Lineup: Choosing the Right Hardware

When you're spec'ing a GPU cluster or advising on cloud instance types, you need to know what's available and when to use each.

Data Center GPUs

NVIDIA Data Center GPU Timeline

Click each step to explore

The Decision Matrix

GPU	VRAM	FP16 TFLOPS	Best For	AWS Instance	Approx Cost/hr
T4	16 GB	65	Small model inference, batch processing	g4dn.xlarge	$0.53
A10G	24 GB	125	Mid-range inference, fine-tuning small models	g5.xlarge	$1.01
L4	24 GB	121	Efficient inference (low power)	g6.xlarge	$0.80
A100 40GB	40 GB	312	Training, large model inference	p4d.24xlarge (8×)	$32.77
A100 80GB	80 GB	312	Large model training, LLM inference	p4de.24xlarge (8×)	$40.97
H100	80 GB	990	LLM training, highest throughput inference	p5.48xlarge (8×)	$98.32

How to Think About GPU Selection

The decision tree for inference workloads:

Model size (in deployment precision)?
│
├── < 14 GB → T4 (16 GB, cheapest)
│    └── Example: 7B model in INT4 (3.5 GB)
│    └── Example: ResNet-50 in FP16 (100 MB)
│
├── 14–22 GB → A10G or L4 (24 GB)
│    └── Example: 7B model in FP16 (14 GB) + KV cache
│    └── Example: 13B model in INT8 (13 GB) + small batch
│
├── 22–38 GB → A100 40GB
│    └── Example: 13B model in FP16 (26 GB) + KV cache
│
├── 38–75 GB → A100 80GB (single GPU)
│    └── Example: 70B model in INT4 (35 GB) + KV cache
│
└── > 75 GB → Multi-GPU (tensor parallelism)
     └── Example: 70B in FP16 (140 GB) → 2× A100 80GB
     └── Example: 70B in INT8 (70 GB) + KV cache → 2× A100 80GB

KEY CONCEPT

For training workloads, the calculation changes significantly: training requires 3-4x more memory than inference (optimizer states, gradients, activations) and benefits heavily from fast interconnect (NVLink, NVSwitch). A model that infers on a single A100 may need 4-8 GPUs for training.

Cloud Instance Gotchas

Things that will bite you in production:

1. Not all A100s are equal. AWS p4d instances have A100 40GB. AWS p4de instances have A100 80GB. The e suffix doubles the memory. If your ML team tested on an 80GB A100 locally and you deployed to p4d (40GB), the model won't fit.

2. GPU networking varies by instance. p4d.24xlarge gives you 8× A100 with NVSwitch (all-to-all 600 GB/s). If you use 8 separate g5.xlarge instances instead (1 A10G each), inter-GPU communication goes over the network, 10-100x slower. This matters enormously for tensor parallelism.

3. EBS bandwidth is a bottleneck. Model weights load from disk to GPU memory at startup. A 140 GB model on a gp3 EBS volume (125 MB/s default) takes 19 minutes to load. On io2 with 4 GB/s provisioned, it takes 35 seconds. If your pods take minutes to start, check disk throughput before blaming the GPU.

# Check EBS throughput during model load
$ iostat -x 1
Device  rrqm/s  wrqm/s    r/s    w/s   rMB/s   wMB/s  await  %util
nvme0n1   0.00    0.00  960.0    0.0   120.0     0.0   1.04  100.0
#                                       ^^^^^
#                              Only 120 MB/s — EBS throttled

WAR STORY

A team spent a week trying to figure out why their inference pods took 12 minutes to become ready. They suspected GPU driver issues, container image size, Python import times, everything except storage. The model was 70GB in INT4 and the gp3 volume was provisioned at the default 125 MB/s. Switching to io2 with 3000 MB/s provisioned IOPS brought startup to under a minute. Always check iostat during model load.

4. Spot instances terminate without warning. GPU spot instances are 60-70% cheaper but can be reclaimed with 2 minutes notice. This is fine for training with checkpoints. It's not fine for production inference serving real users.

PRO TIP

Use spot instances for training and batch inference. Use on-demand or reserved instances for real-time inference endpoints. The cost savings from spot are substantial (60-70%) but the reliability trade-off is only acceptable when your workload can tolerate interruption.

Investigation Steps: Your First GPU Debugging Checklist

When a GPU workload isn't working, work through these steps in order:

# Step 1: Is the GPU visible to the node?
nvidia-smi
# → If this fails: driver not installed, or GPU hardware issue

# Step 2: Is the device plugin running on this node?
kubectl get pods -n kube-system -l app=nvidia-device-plugin -o wide
# → If not running: GPU Operator issue, or DaemonSet not scheduled to this node

# Step 3: Does Kubernetes know about the GPUs?
kubectl describe node <node> | grep nvidia.com/gpu
# Allocatable: nvidia.com/gpu: 8
# Allocated:   nvidia.com/gpu: 3
# → If 0: device plugin can't discover GPUs

# Step 4: Is the pod requesting GPUs?
kubectl get pod <pod> -o yaml | grep nvidia
# → Must have nvidia.com/gpu in limits (not requests — limits only)

# Step 5: Can the container see the GPU?
kubectl exec <pod> -- nvidia-smi
# → If fails: container runtime not configured for GPU access

# Step 6: Is the CUDA version compatible?
kubectl exec <pod> -- nvcc --version
# → Compare with host driver version (driver must be ≥ runtime)

# Step 7: Does the model fit in memory?
kubectl exec <pod> -- python -c \
  "import torch; print(f'{torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"
# → Compare with model size estimate

PRO TIP

Print this checklist. Tape it to your monitor. The first 50 GPU issues you debug will be one of these 7 steps. After that, you'll have the muscle memory to skip straight to the likely cause.

Key Concepts Summary

GPUs are non-divisible, non-shared, non-compressible resources in Kubernetes, fundamentally different from CPU scheduling
GPU memory (VRAM) is a hard wall with no swap and no overcommit, if it doesn't fit, it crashes
CUDA driver lives on the host, runtime lives in the container, driver version must be ≥ runtime version
nvidia-smi's "CUDA Version" is misleading, it shows the max supported runtime, not what's installed
SMs (Streaming Multiprocessors) are the compute units that MIG divides, understanding SMs lets you reason about MIG partitioning
Tensor Cores handle matrix multiplications at 64x the speed of CUDA cores, mixed precision (FP16/BF16) leverages these for 2x throughput at half the memory
MIG lets you carve one GPU into up to 7 isolated instances, the solution for wasted GPU capacity on small inference workloads
ECC uncorrectable errors = hardware degradation = replace the GPU
Power draw is one of the fastest signals for "is the GPU actually working", 80W during heavy inference means the bottleneck is upstream

Common Mistakes

Confusing nvidia-smi's "CUDA Version" (max supported) with the actually installed CUDA runtime (they're different things)
Running mixed GPU types (T4 + A100) without setting node selectors, causing pods to schedule on incompatible hardware
Assuming GPU memory works like CPU memory: there is no swap, no overcommit, no graceful degradation
Ignoring EBS/disk throughput when debugging slow pod startup, model loading from slow storage adds minutes
Deploying to p4d (A100 40GB) when the model was tested on A100 80GB, the e suffix matters
Using 8 separate single-GPU instances instead of one 8-GPU instance for multi-GPU workloads, inter-GPU communication over network is 10-100x slower than NVLink
Seeing 0% GPU utilization and assuming the GPU is broken (it might just be between inference requests, check over a longer sample window)

Interview Questions

These are the kinds of questions senior GPU infrastructure engineers get asked. Use them to test your understanding.

Q: Your ML team says they need to deploy a 70B parameter model for inference. Walk me through how you'd determine the GPU requirements.

Think about: precision choice → VRAM calculation (parameters × bytes per precision) → KV cache impact at target batch size → single vs multi-GPU → instance type selection → cost trade-offs between A100 80GB and 2× A100 40GB.

Q: A GPU pod works on node-A but crashes on node-B with "CUDA error: no kernel image is available for execution on the device." Both nodes have GPUs. What's happening?

Think about: compute capability mismatch → different GPU models across nodes (e.g., T4 vs A100) → container built with TORCH_CUDA_ARCH_LIST targeting only one architecture → fix with multi-arch build or node labels and scheduling constraints.

Q: You have a cluster with A100 80GB GPUs running inference for a 7B model in FP16. GPU utilization hovers at 5%. What would you do to improve cost efficiency?

Think about: MIG partitioning to split one A100 into multiple instances → each running a separate model or replica → 7× better utilization on a single GPU → or downsizing to a cheaper GPU (T4/A10G) if the model fits.

What's Next

In the next lesson, we'll dive into the NVIDIA driver stack, the software that bridges your Kubernetes cluster to the GPU hardware. You'll learn about the different driver types (datacenter vs consumer), how the GPU Operator manages driver lifecycle across your fleet, and how to avoid the driver compatibility nightmares that plague every GPU cluster at scale.

KNOWLEDGE CHECK

In Kubernetes, how are GPU resources allocated to pods by default?

Continue

The NVIDIA Driver Stack

←→ navigateM toggle sidebar