GPU Cost Optimization on Kubernetes

The True Cost of a GPU

Most teams know what an A100 costs per hour. They will quote $3.06/hr on demand for a p4d.24xlarge on AWS, or $3.67/hr for an A100 on GCP. They will then divide that by their request rate, get something like $0.002/request, and put it on a dashboard. The number is wrong by anywhere from 30% to 200%, and the gap is where the optimization lives.

This lesson is the actual cost of running a GPU in production. The hourly compute rate is one line item. There are five more, and any of them can be larger than people realize. Get this number right and the rest of the course shows you how to cut it; get it wrong and you optimize the wrong things.

KEY CONCEPT

The "GPU costs $3/hr" number is the list price for one component of one configuration. The actual cost per running GPU in production is the hourly rate plus storage plus network plus idle overhead plus driver/operator overhead plus the share of the broader cluster the GPU node consumes. For most production setups, the true cost is 1.5x to 2.5x the list compute rate.

The five line items

A GPU node's true monthly cost is the sum of:

A GPU node listed at $3/hr ends up costing $4.50-$7.50/hr in actual usage once everything is counted. This is not waste; some of it is unavoidable. But knowing the breakdown is the first step to attacking the parts that are.

Line item 1: compute

The headline number. For NVIDIA GPUs on the major clouds, current on-demand pricing roughly tracks:

Two things people miss when they read this table:

Per-GPU vs per-instance

Cloud SKUs bundle GPUs. An AWS p4d.24xlarge is $32.77/hr on-demand; that gets you 8 A100 40GB GPUs. The "per-GPU" cost is $32.77 / 8 = $4.10/hr. People sometimes quote the instance price ("we're paying $32/hr for our GPU box") and conclude it is too expensive when really they have 8 GPUs in there.

The per-instance price matters for capacity planning (you cannot rent half an instance), but the per-GPU price is what compares fairly across instance types.

List price vs effective price

The list price assumes:

On-demand (no spot, no reserved).
24x7 utilization.
Linux (Windows costs more).

A team using spot, reserved instances, or savings plans pays anywhere from 30% to 70% less than list. A team running pods at 30% utilization pays the same list price but effective per-utilized-second is 3x higher. Both effects matter; both should be in your TCO calc.

The relevant number for cost optimization is "effective USD per useful work done." The list price is the starting point, not the answer.

Line item 2: storage

Three storage costs hit every GPU node:

Root disk (boot volume)

Default 30-100 GiB, often gp3 or gp2 (older). Costs a few dollars a month. Mostly negligible.

Image storage

ML images are big. A typical PyTorch + CUDA image is 5-10 GiB; vLLM with model code can be 15+ GiB. With node churn, image cache turnover, and N nodes, the on-node image storage adds up.

For a 10-node A100 cluster running PyTorch images:

10 nodes * 10 GiB image storage = 100 GiB
ECR or registry storage: a few cents per GB-month, plus pull egress
Disk cost: typically ~$10-50/month

Small in absolute terms, but image-pull time is bigger, covered in line item 4.

Model weights and dataset storage

This is the big one. ML workloads cache model weights locally for fast loading:

Llama 3 70B FP16 weights: 140 GiB
Llama 3 405B FP16: 810 GiB
A typical fine-tuned variant: 30-200 GiB

If every pod has its own copy on the local disk (default behavior of many naive deployments), and you have N replicas across M nodes, you pay for N * model_size in disk. With model weights at 140 GiB and 8 replicas: 1.1 TiB.

Better patterns (lesson 5.1 of Production LLM Inference covers this in depth): a shared volume (EFS, FSx for Lustre, hostPath cache, or persistent EBS), or model loading from S3/GCS at startup with appropriate caching.

Storage cost for a typical inference cluster runs 5-15% of compute when measured properly. A team I worked with had local NVMe scratch for every replica's model copy at $0.30/GiB-month; the storage line item was $1,200/month for a $25,000/month compute spend (5%).

Line item 3: network egress

The most frequently overlooked cost. Two main sources:

Model loading

A 70B-parameter model in FP16 is 140 GiB. Loading it from S3 costs 0.09 * 140 = $12.60 per pull (US-East AWS egress to internet, but model loads from S3 to EC2 in the same region are free; cross-region or cross-cloud pulls are not).

If you scale from 0 to N pods, you pay N * (140 GiB worth of egress), though if S3 is in the same region, it is free. The catch: many teams use a multi-region setup where the model lives in us-east-1 S3 but pods run in eu-west-1. Cross-region egress of 140 GiB per pod start is real money.

Inference traffic

Each inference request has a request payload (small) and a response (tokens, also small for chat; large for image generation). The data transfer:

Pod to client (egress out): 1-5 cents per GB on AWS.
Pod to pod cross-AZ: 1-2 cents per GB. Adds up for chatty patterns.
Pod to S3 (intra-region): free.
Pod to managed cache (Redis, MemoryDB): free if same VPC.

For a busy inference service handling 1000 RPS with 2 KB average response, that's 2 MB/s = 5.2 TB/month of egress. At 1 cent/GB cross-AZ, that's $520/month. At 5 cents/GB egress to internet (if served via internet-routed NLB), that's $2,600/month.

For inference at scale, network egress can be 10-30% of compute cost.

Distributed training

Multi-node training crosses NVLink/NVSwitch (free), then InfiniBand (free if EFA, on the same instance), then VPC traffic (chargeable cross-AZ). A training job with all-reduce across 8 nodes can move multi-petabytes per epoch.

For a team training Llama-class models, network egress can spike to 30-50% of compute. The fix is keeping training within a single placement group or single-AZ cluster, covered in module 4 of Production GPU Infrastructure.

Line item 4: idle and overhead time

The largest line item for most under-utilized clusters. Three components:

Pod startup overhead

A typical inference pod takes 60-300 seconds from "scheduled" to "Ready":

30-60s for image pull (cold image)
30-180s for model loading (depends on size and where weights come from)
5-30s for warmup, JIT compilation, kernel autotuning

During this entire window, the GPU is reserved (the pod has a nvidia.com/gpu: 1 resource request) but is doing no useful work. You pay for the GPU during startup.

For a service that scales pods often, this is a major recurring cost. Each scale-up event "wastes" 60-300s of GPU time per pod added.

Between-batch idle (continuous batching helps; not all workloads use it)

For a single-batch inference server (not continuous batching), the GPU is idle between batches. If batches arrive every 100ms but each batch takes 50ms to process, the GPU is at 50% utilization. The other 50% is paid-for idle.

Continuous batching (vLLM, TGI) fixes most of this, the GPU is always processing tokens, but only if you are using it. Many teams running Triton or custom servers without dynamic batching see much lower utilization.

Capacity-driven over-provisioning

This is the bigger one. Most teams provision GPU capacity for peak load, not average. Average might be 30% of peak; the rest of the time, GPUs are idle.

Combined with HPA misconfigured for GPUs (Module 3.1 covers this), it is common to see clusters where GPUs run at 20-40% utilization on average. That means 60-80% of GPU spend is on idle time.

This is the largest single optimization lever in most clusters. A team I helped reduced their GPU bill from $180K/month to $67K/month over a quarter, most of the savings came from increasing average utilization from 25% to 65%, not from cheaper GPUs.

Line item 5: cluster overhead

Every node in the cluster pays its share of cluster-wide infrastructure:

kube-system DaemonSets: kube-proxy, CNI agent (Calico, Cilium), CSI node plugin, NodeProblemDetector. Typically 100-300 MiB of memory and small CPU per node. Not GPU-bound, but you cannot share the node between GPU work and "system stuff" without Resource Reservation.
Monitoring agents: Datadog, Prometheus node-exporter, dcgm-exporter, log shippers. Another 200-500 MiB of memory per node.
GPU operator components: nvidia-device-plugin, dcgm-exporter, gpu-feature-discovery, fabric-manager. Adds 200-500 MiB of memory; a small CPU footprint.
Service mesh sidecars (if applicable): Istio, Linkerd. Memory overhead per pod, plus a separate sidecar pod per workload pod.

These typically eat 5-10% of node capacity. On a GPU node where the hourly compute is $4/hr, that's $0.20-$0.40/hr of "cluster tax." Not large, but not zero.

The fix at scale: dedicated system node pools with cheaper instances (covered in Production K8s Operations Module 2.2). Run system DaemonSets on the GPU nodes (necessary), but keep the bulk of cluster-wide controllers and ingresses on cheap CPU nodes.

Bringing it together: the "true cost" formula

For a representative on-demand A100 80GB node running an inference workload:

Line item	Hourly cost	Monthly (24x7)
Compute (list price)	`$4.00`	`$2,880`
Storage (root + image cache + model weights)	`$0.20`	`$144`
Network egress (cross-AZ + outbound)	`$0.40`	`$288`
Idle overhead (startup + low utilization)	`$1.20`	`$864`
Cluster overhead (DaemonSets + monitoring)	`$0.20`	`$144`
True cost	`$6.00`	`$4,320`

The headline $4/hr is 67% of the real number. Cost-per-request based on $4/hr is wrong by 50%. Cost-per-token based on it is wrong by 50%.

The real cost is what you compare optimization options against. If a switch from on-demand to 1-year reserved cuts compute by 40% ($4 -> $2.40/hr), but your true cost was $6, the savings are $1.60 not $4. Still real money, but a smaller fraction than you might think.

Why your "cost per request" is probably wrong

Most teams compute cost per request as:

cost_per_request = (compute_per_hour) / (requests_per_hour)

This is wrong in three ways:

Ignores idle

If the cluster is provisioned for peak (1000 RPS) but averages 300 RPS, the formula uses 300 RPS for the denominator but the cluster cost reflects 1000 RPS of capacity. The formula understates by 3x.

The fix:

true_cost_per_request = (true_compute_per_hour) / (avg_requests_per_hour)

Use average requests over the time the cluster runs, not peak. And use true compute (all five line items).

Ignores per-token billing in inference

For LLM inference, "cost per request" is misleading because requests vary wildly in size. A single chat with 100 input tokens and 50 output tokens uses much less GPU time than a 10K-token RAG query with 2K-token output.

Better: cost per token, separated for input (cheap, can be batched aggressively) and output (expensive, sequential generation).

cost_per_input_token = (cost_per_request_input_phase) / input_tokens
cost_per_output_token = (cost_per_request_output_phase) / output_tokens

For most LLMs, output tokens cost 5-10x more than input tokens because of how prefill vs decode phases utilize the GPU.

Ignores variable batching

Continuous batching means a single GPU might serve 10-50 concurrent requests at once. The "cost per request" depends on batch size, bigger batches mean lower per-request cost.

For accurate costing, use cost per token (not per request) and report it for the actual batch sizes you serve in production.

WAR STORY

A team I helped showed me their cost dashboard: "$0.001 per request." Looked great. We dug in and found: their requests count in the denominator was a Prometheus query that excluded retries and health checks (about 60% of actual API calls). The compute cost was list price for on-demand only (they were 60% spot, but the reserved share they had purchased was not amortized correctly). And they were measuring "average" RPS during business hours only, not 24x7. Real cost per request was about 5x what the dashboard showed. The dashboard was not lying; it was showing a computable but meaningless metric. Cost dashboards must be skeptically reviewed; the formulas matter.

Cost per useful work, not cost per cluster hour

The most useful cost metric is cost per useful work done, where "useful work" is whatever your business measures:

For LLM inference: cost per million tokens served (input and output separately).
For batch ML inference: cost per million predictions.
For training: cost per epoch or cost per benchmark improvement.
For research: cost per experiment completed.

Tracking this over time tells you whether optimization is working. "We cut our GPU bill by 30%" is good; "we cut cost per million tokens by 40% while traffic doubled" is better, that is real efficiency improvement, not just a smaller cluster.

The course's "$180K → $67K" case study is real, and the team did not just shrink the cluster. They cut cost per million tokens by 65% while traffic grew 30% over the same quarter. That is the actual win.

Summary

The "GPU costs $X/hr" number is one component of the real cost. The five line items:

Compute: the list price; 100% baseline.
Storage: 5-15% (mostly from large model weights).
Network egress: 10-30% (cross-AZ, outbound, cross-region).
Idle and overhead: 15-40% (the largest unnecessary cost in most clusters).
Cluster overhead: 5-10% (DaemonSets, monitoring, GPU operator).

True cost is typically 1.5-2.5x the list compute rate. Cost per request based only on list compute is wrong by 50% or more.

The right metric for tracking optimization: cost per useful work done: cost per million tokens, cost per epoch, cost per inference. Track it over time as the leading indicator of whether the rest of this course is actually saving you money.

The next lesson is the catalog of waste: the three forms of GPU waste that show up in essentially every production cluster, and how to measure each one.

Continue

The Three Forms of Waste

←→ navigateM toggle sidebar