GPU Cost Optimization on Kubernetes

The Three Forms of Waste

Knowing what a GPU costs is half of optimization. The other half is knowing where the money is going that should not be. In every cluster I have audited, the same three patterns of waste account for the bulk of unnecessary GPU spend. Each one is concrete, measurable, and fixable, but only after you know to look for it.

This lesson is the diagnostic: how to measure each form of waste in your cluster, what the typical numbers look like, and which one is usually the biggest.

KEY CONCEPT

The three forms of GPU waste are: wrong GPU type (using an A100 when an A10G would do), no autoscaling (paying for capacity you do not use), and non-GPU pods on GPU nodes (CPU-bound workloads consuming the host while the GPU sits idle). The biggest by spend is usually #2; the easiest to fix is usually #3; the highest leverage is usually #1. Knowing the breakdown for your cluster is the first move.

Form 1: wrong GPU type

The pattern: a workload that needs a T4 ($0.40/hr) is running on an A100 ($4/hr). Same throughput; 10x the cost.

This happens for boring reasons:

The team copied a deployment that someone else wrote.
The team picked the largest GPU because "we're a serious ML shop."
The team needed an A100 for one specific model and reused the same node pool for everything else.
Nobody benchmarked the workload on smaller GPUs.

How to measure it

Look at GPU utilization (DCGM_FI_DEV_GPU_UTIL from dcgm-exporter). If a workload's GPU is running at under 30 percent utilization on average, there is a strong chance a smaller GPU would handle the same load.

Three signals together:

# 1. Average GPU utilization per workload
avg_over_time(
  DCGM_FI_DEV_GPU_UTIL{exported_namespace="prod-inference"}[1d]
) by (pod)

# 2. Peak GPU memory usage
max_over_time(
  DCGM_FI_DEV_FB_USED{exported_namespace="prod-inference"}[1d]
) by (pod)

# 3. Peak GPU SM (streaming multiprocessor) utilization
max_over_time(
  DCGM_FI_PROF_SM_ACTIVE{exported_namespace="prod-inference"}[1d]
) by (pod)

If average utilization is 20%, memory usage is 30 GB on an 80 GB GPU, and peak SM activity is 40%, the workload is using maybe 30% of an A100's capability. An A10G (24 GB, smaller compute) would handle it for 25-30% of the cost.

Common cases

The patterns I see most:

Inference on A100 80GB when A10G or L4 would suffice. Common for small models (7B, 13B). The smaller GPU has enough VRAM and enough throughput; the A100 is overkill.
Embedding generation on A100. Embedding workloads are mostly memory-bandwidth-bound, not compute. T4 or L4 is often plenty.
Light inference on H100. The H100 is the new shiny; some teams pick it without checking if A100 or A10G would do.
Image generation on H100 when A100 is fine. Diffusion models often fit on A100 80GB and run at acceptable latency. H100 is faster but the price difference is large.

What this costs

In every cluster I have audited, between 20% and 40% of GPU spend is on workloads that are over-provisioned by GPU type. A common pattern: inference fleet running 12 A100 80GB nodes at $4.50/hr each = $1,300/day. After right-sizing to 12 A10G nodes (at $1.20/hr): $345/day. Same throughput, $30K/month savings.

The fix is module 2 of this course. The measurement is here; the action plan is the next module.

Form 2: no autoscaling

The pattern: the cluster has 20 GPU nodes 24/7. Average utilization is 30%. Peak is 80%. The cluster is paying for 20 nodes when it really needs 8 most of the time, scaling to 16 at peak.

This is the largest single form of waste in most clusters because it is multiplicative across the fleet.

Why it happens

Three reasons:

HPA does not work for GPUs by default. CPU-based HPA does not see GPU usage. Without custom metrics, HPA cannot scale GPU workloads.
Cluster autoscaler is slow on GPUs. A new GPU node can take 3-10 minutes to provision and another 1-3 minutes to pull large images. Teams "pre-provision" capacity to avoid the wait.
Manual scaling. Teams set replicas based on peak forecast, not real load. The fleet stays at peak size 24/7.

How to measure it

The single most useful metric: cluster GPU utilization over a full day.

# Average GPU utilization across the fleet, over 24 hours
avg(
  DCGM_FI_DEV_GPU_UTIL{namespace="prod-inference"}
)

If this number is < 50%, you have autoscaling opportunity. < 30% is "huge opportunity." Below 20% is "are you sure these workloads need GPUs?"

A useful drill-down: utilization by hour-of-day.

avg(
  avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace="prod-inference"}[1h])
) by (hour)

Most user-facing inference shows clear day/night patterns, peak business hours at 60-80%, overnight at 5-15%. The overnight idle is what scaling should reclaim.

What this costs

For a 20-node A100 cluster:

24/7 cost: 20 nodes × $4.50/hr × 24 × 30 = $64,800/month.
With proper autoscaling (avg 12 nodes, peaking to 18): $38,880/month.
Savings: $26,000/month (40% reduction).

This is typical. Autoscaling without changing anything else cuts most clusters' spend by 30-50%.

The fix is module 3 of this course (HPA, cluster autoscaler, scale-to-zero). The measurement is here.

Form 3: non-GPU pods on GPU nodes

The pattern: a GPU node has 96 vCPU and 1.1 TB RAM. The GPU workload uses 1 GPU, 8 CPUs, and 64 GiB RAM. Other pods (logging agents, sidecars, "small batch jobs the team didn't bother to put on a separate node pool") fill in the rest. The GPU is effectively underutilized because the node is "full" of CPU work.

This sounds wrong, a node has capacity, why not use it?, but the math is brutal.

The math

A p4d.24xlarge (8 × A100 40GB) costs ~$32/hr on-demand. The CPU cost on that instance, if you had to buy CPU separately at the same provider, would be ~$3/hr (96 vCPU at AWS general-purpose pricing of about $0.03/vCPU-hr). So you are using 90% of the instance cost for GPU work and 10% for CPU work.

When a CPU-bound pod takes some of those vCPUs, it is paying $0.03/vCPU-hr equivalent, but actually paying its share of $32/hr. If a CPU pod uses 8 vCPUs of an 96-vCPU GPU node, that is 8/96 × $32 = $2.67/hr for what should cost ~$0.24/hr on a CPU-only node. 11x markup.

The CPU pod is fine; the GPU node is more expensive. The team is paying GPU prices for CPU work.

How to measure it

# Number of pods on GPU nodes that don't request GPUs
count(
  kube_pod_info{node=~".*gpu.*"} unless on(pod, namespace)
  kube_pod_container_resource_requests{resource="nvidia_com_gpu", node=~".*gpu.*"}
)

(Adjust selectors based on your label scheme.) If this number is non-trivial, a cluster of 20 GPU nodes with 50+ non-GPU pods on them, you have this form of waste.

A simpler check: SSH or kubectl debug node into a GPU node and run crictl ps. Count containers. Count which ones are GPU workloads. The non-GPU containers are likely CPU pods on a GPU node.

Common cases

What ends up on GPU nodes that should not:

Logging agents and metrics scrapers: Fluent Bit, datadog-agent, prometheus node-exporter, dcgm-exporter. These belong on every node (DaemonSet) but waste on GPU nodes. The fix is to use lightweight sidecars and dedicated CPU node pools for the heavy aggregators.
Service mesh sidecars: Istio's istio-proxy, Linkerd-proxy. Each adds 100-300 MiB of memory and a CPU share. For inference workloads where the service mesh is not adding value (e.g., pod-to-pod traffic that doesn't need mesh), disable injection.
CI/test runners: occasionally teams use spare GPU node capacity for "fast" test runners. Cheap-feeling but expensive in reality.
"It's where there was capacity": misconfigured scheduling that lands non-GPU pods on GPU nodes opportunistically.

The fix

Two parts:

Taint the GPU nodes: nvidia.com/gpu=true:NoSchedule. Only pods with a matching toleration can land here. Most non-GPU pods do not have the toleration; they go elsewhere.
Tolerate intentionally: GPU workloads add the toleration; system DaemonSets that genuinely need to run on every node add it; nothing else.

# On the GPU node group / NodePool
taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

# On the GPU workload
spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

After the taint is in place, audit which pods still tolerate it. Most should not.

What this costs

Depends on cluster shape. For a typical 10-node GPU cluster:

Without taints: 30-50 non-GPU pods using ~10-15% of the cluster's CPU and memory. Equivalent CPU-only cost on dedicated nodes: ~$1,200/month. Cost on GPU nodes: ~$5,000-7,000/month.
After taints: 5-10 system pods (taint-tolerated DaemonSets). Equivalent waste: $500-1000/month.

Savings: $4,000-6,000/month for a typical 10-node cluster. Smaller in absolute terms than autoscaling savings, but cheap to fix (a taint and a few tolerations).

The 80/20 of measurement

If you want to assess your cluster quickly:

One metric for #1 (wrong GPU type): avg_over_time(DCGM_FI_DEV_GPU_UTIL[7d]) per workload. Anything sustained below 30% is a candidate.
One metric for #2 (no autoscaling): avg_over_time(DCGM_FI_DEV_GPU_UTIL[7d]) cluster-wide. Anything below 50% is a candidate.
One metric for #3 (non-GPU on GPU): count of pods on GPU nodes that do not request GPUs.

Three queries. Fifteen minutes. You will have a rough sizing of each form of waste in your cluster.

Then prioritize. Usually:

Form 2 is the biggest by spend (often $20-40K/month for a typical 10-20 node fleet).
Form 1 is the next biggest (often $10-20K/month).
Form 3 is the smallest in $ but easiest to fix ($2-5K/month for the cluster, fixed in an hour).

A concrete walkthrough

A real cluster I audited:

Setup: 24 nodes, mix of A100 40GB and A10G. ML inference for a B2B SaaS. Monthly compute spend: $87K.
Form 1 (wrong GPU type): 12 A100 nodes running embedding generation. Average GPU utilization 22%. Memory usage 18 GiB out of 40. Recommendation: move to A10G (24 GB). Estimated savings: $25K/month (about 28% of total).
Form 2 (no autoscaling): cluster running at 24 nodes 24/7. Average daily utilization 31%. Recommendation: HPA on request rate, cluster autoscaler. Average node count would drop to 14, peaking to 22. Estimated savings: $30K/month (about 35% of total).
Form 3 (non-GPU on GPU): 23 logging/monitoring/CI pods on GPU nodes. Move to dedicated CPU nodes. Estimated savings: $4K/month (about 5%).

After all three: $59K savings/month, from $87K to $28K. Real numbers.

Note that Form 1 and Form 2 partially overlap: if the wrong-GPU-type fix moves embedding to A10G, the autoscaling also applies on A10G, so the savings stack but with some interaction. The actual net was about 65% savings, not 80%.

What is NOT GPU waste

Some patterns that look like waste but are not:

Spare capacity for spike absorption: a cluster sized for 1.5x average load is not wasting; it is absorbing variance. Different from sized-for-peak-and-staying-there.
Warm pools for fast scale-up: keeping a few idle GPU nodes warm so a scale-up event does not take 5 minutes. Expensive but sometimes worth it for latency-sensitive services.
Compliance-driven dedicated capacity: some compliance frameworks require dedicated nodes for specific workloads. The "waste" is regulatory, not technical.
Cross-AZ replicas for HA: two replicas in two AZs cost more than one in one AZ; the HA value justifies it.

The discipline: separate "we are choosing this" from "we don't know any better." The first is a deliberate trade-off; the second is the waste this lesson is about.

WAR STORY

A team I helped had 30% average GPU utilization. They came to me to "tune the model" or "switch to faster GPUs." We measured: actual workload only used 25-30 GiB of memory; A100 40GB had 10 GiB free per replica. They had 16 A100 80GB nodes; each was running 1 model replica. Right-sizing was: A100 40GB instead (saves 30% per node), or even better, 2 replicas per A100 80GB to actually use the capacity. They ended up doing both. A10G for new workloads, MIG-partitioned A100 80GB for legacy. Cluster spend dropped 55% in a quarter. The "waste" was not in the model; it was in the GPU choice.

How often to remeasure

Once is not enough. The patterns of waste change as workloads evolve:

New workloads launch on the wrong GPU type as a default (Form 1).
Auto-scaling drifts as workload patterns change (Form 2).
New non-GPU pods accumulate on GPU nodes if taints are not enforced (Form 3).

A quarterly audit catches all three before they grow. A monthly dashboard showing the three metrics keeps the team honest.

The course's final module covers the audit ritual in detail. For now, install the metrics and look at them once a quarter.

Summary

Three forms of GPU waste, each with a specific measurement:

Form 1: wrong GPU type: workload uses 20-30% of a GPU's capability; a smaller GPU would do. Measured by per-workload GPU utilization. Usually 20-40% of total spend.
Form 2: no autoscaling: cluster runs 24/7 at peak size; average utilization 30%. Measured by cluster-wide GPU utilization. Usually 30-50% of total spend.
Form 3: non-GPU pods on GPU nodes: CPU-bound pods consuming GPU node capacity. Measured by counting non-GPU pods on GPU nodes. Usually 5-10% of total spend; cheapest to fix.

The right diagnostic: three Prometheus queries, fifteen minutes. Estimate the savings from each form for your cluster. Prioritize by $/hour-of-engineering: usually that is autoscaling first (biggest by spend), GPU type next (next biggest), taints last (small but fast).

The next lesson is the prioritized list, the 80/20 of GPU optimization across all three forms, so you know which fixes to apply first.

The True Cost of a GPU

Continue

The 80/20 of GPU Optimization

←→ navigateM toggle sidebar