The 80/20 of GPU Optimization
The previous lesson catalogued the three forms of waste. This one is the action plan: of all the things you could do to optimize GPU spend, which ones deliver the most for the least effort, and what order to apply them in.
This is a prescriptive lesson. The full course covers each of these techniques in depth; this lesson is the priority queue. If you have a quarter to cut your GPU bill in half, here is what you do first, second, third — and what you might consciously choose not to do because the ROI does not justify the engineering cost.
The high-leverage GPU optimizations are: GPU-aware autoscaling, right-sizing GPU types, and spot/reserved capacity for the steady-state. Three changes usually capture 70-80 percent of the achievable savings. The remaining 20-30 percent comes from a long tail of smaller improvements — quantization, MIG, ingress optimization — each worth a few percentage points individually. Spend the first quarter on the top three; come back to the long tail in quarter two.
The ranked priority list
In rough order of impact for typical GPU clusters:
The order is roughly the order of $/engineering-hour. Top items = biggest savings per week of work. Bottom items = smaller savings, sometimes still worth doing but with diminishing returns.
#1 — GPU-aware autoscaling
The biggest lever in most clusters because it attacks idle time directly.
Why it is #1
Almost every team I have audited has GPU clusters running at 30-50% average utilization. The other 50-70% is idle time that you pay for. Autoscaling reclaims most of it.
The math:
- Cluster sized for peak: $90K/month, 30% average utilization.
- Cluster autoscaling on demand: average node count drops to 50% of peak. $45K/month average.
- Plus warmup capacity and overhead: real number is more like $55-60K. Still a 35-40% reduction.
For a typical mid-sized cluster, this is $20-40K/month in savings. Bigger clusters scale proportionally.
What to do
The full Module 3 covers this. The quick version:
- Install custom metrics for HPA: GPU utilization is necessary; request rate is better; queue depth is best for dynamic batching. CPU-based HPA does not work for GPU workloads.
- HPA with reasonable scale-up speed: GPU pods take 60-300 seconds to be Ready. The autoscaler needs to compensate by scaling up before SLO breaks, not after.
- Cluster autoscaler or Karpenter for GPU node pools: when HPA wants more replicas than the cluster has GPU nodes for, the cluster autoscaler provisions more. Karpenter is much faster than Cluster Autoscaler for GPU instances.
- Tune scale-down: GPU pods are expensive to start up; do not scale down too aggressively. A 10-15 minute "stabilization window" prevents flapping that costs more than it saves.
What to watch out for
- Cold start latency: scaling from 0 means waiting for image pulls and model loading. For latency-sensitive services, keep a minimum replica floor.
- Spot capacity availability: if you autoscale onto spot and spot is unavailable, you wait. Have on-demand fallback in your NodePool.
- Quota limits: GPU instance quotas in cloud accounts are often low. Make sure your scale-up ceiling does not exceed your quota.
Time investment
A quarter of focused engineering work. HPA tuning is a few days; cluster autoscaler configuration is a few days; testing and rollout takes the rest.
#2 — Right-size GPU types
The second-biggest lever. Often overlapping with autoscaling but distinct.
Why it is #2
Many workloads are running on more powerful GPUs than they need. The pattern from lesson 1.2:
- An embedding workload running on A100 80GB at 22% utilization. Same throughput on A10G.
- An inference workload running 7B model on A100 40GB. Fits comfortably on L4 with good throughput.
- A small fine-tuning job running on H100. Could finish on A100 in 1.5x the time at 50% the cost.
A100 to A10G is roughly 4x cheaper per hour. L4 is even cheaper for the right workloads. Right-sizing is a one-time decision per workload class with permanent ongoing savings.
What to do
For each workload class:
- Measure on the current GPU: average utilization, peak memory usage, peak throughput.
- Benchmark on smaller GPUs: run the same workload on A10G or L4. Measure throughput and latency.
- Decide based on cost-per-throughput: not "is the smaller GPU faster" but "is the smaller GPU cheaper per useful work."
- Roll out gradually: canary on the smaller GPU; compare metrics; promote.
The full module 2 walks through the decision tree. Key heuristics:
- Memory is the constraint for many workloads. If model weights + KV cache fit in 24 GB, A10G works. If they need 40 GB, A100 40GB. If they need 80 GB, A100 80GB or H100.
- Compute throughput matters less than people think for inference. Many inference workloads are memory-bandwidth-bound, not compute-bound. The smaller GPU's lower TFLOPS is often acceptable.
- Compile-time vs runtime trade-offs. INT8 / INT4 quantization (covered in #4) often shifts the workload to a smaller GPU.
What to watch out for
- Don't right-size in production without benchmarking. The smaller GPU might be slower than expected for your specific workload.
- Concurrent batch size matters. Some smaller GPUs have less memory bandwidth, which means smaller max batch size. If your workload depends on large batches for throughput, the smaller GPU may have lower max throughput, not just lower per-request latency.
- Driver compatibility. Newer GPUs (H100, L4) require newer CUDA drivers; older deployments may need to upgrade.
Time investment
Two weeks per workload class for benchmarking and rollout. With 5-10 workload classes in a typical cluster, this is a quarter of work spread across teams.
#3 — Spot for training, savings plans for inference
The capacity-side lever. After you have right-sized and autoscaled, the remaining compute is "what we will be running 24/7." That capacity should not be on-demand.
Why it is #3
Spot, savings plans, and reserved instances offer 30-70% off list compute prices. For the "always-on" portion of your cluster, this is free money — same workload, much cheaper compute.
The split:
- Training is mostly batch and tolerant of interruption (with checkpointing). Spot is the right answer; saves 60-70% off on-demand.
- Inference baseline is the steady-state load that runs 24/7. Reserved Instances or Savings Plans cover this; saves 30-40% off on-demand.
- Inference burst capacity is the peak above baseline. Stay on-demand; you do not need a commitment that may go unused.
What to do
Module 4 covers this in detail. The shape:
- For training: switch to spot. Implement checkpointing; choose instance families with reasonable spot interruption rates; have on-demand fallback.
- For inference baseline: figure out your steady-state node count over a typical month. Buy Savings Plans or RIs for that capacity. Commit at most 60-70% of average usage to leave headroom.
- For burst: stay on-demand. Karpenter handles the dynamic provisioning.
What to watch out for
- Spot interruption rate varies by region and instance type. Check the AWS interruption rate dashboard (or cloud equivalent). Some GPU instances have 20 percent or higher interruption rates; some have under 5 percent.
- Savings Plans are 1-year or 3-year. Picking 3-year locks you in; growth or workload changes might leave commitments unused.
- GPU Reserved Instances are not always available for the GPU type you want. Some clouds discount specific instance types; check before committing.
Time investment
A few weeks for spot adoption (checkpointing infrastructure, fleet diversification). A few hours of finance work for Savings Plan / RI purchases.
#4 — Quantization
A model-side lever that enables smaller GPUs. Does not directly save GPU cost; enables right-sizing.
Why it is #4 (not higher)
Quantization is workload-specific. Some models quantize to INT8 with no quality loss; others lose accuracy at INT8 and would need engineering work to retain quality. The ROI depends entirely on the model.
For LLMs, INT8 / INT4 quantization is increasingly mature and often free quality-wise. For smaller specialized models, sometimes quantization is "free"; sometimes requires retraining or QAT (quantization-aware training).
What to do
- Measure baseline quality on the FP16 version.
- Try post-training quantization first: bitsandbytes, GPTQ, AWQ for LLMs. Run the same eval suite. Compare quality.
- If quality is acceptable, deploy. Monitor production quality with sampled eval.
- If quality drops, consider QAT or a different quantization method. Or accept the quality drop if the cost savings justify it.
Module 2.3 covers the decision in depth.
What to watch out for
- Quality regression. INT8 sometimes silently degrades quality on edge cases. Have an eval suite that catches it.
- Latency vs throughput trade-off. Quantization sometimes increases throughput (more concurrent requests) but slightly increases per-request latency due to dequantization overhead. Measure both.
- Hardware-specific support. INT4 needs Tensor Core support; older GPUs (T4) only have INT8 Tensor Cores; newer GPUs (H100) have FP8.
Time investment
A week per model class for evaluation and rollout. Faster for popular models with established quantization recipes (Llama, Mistral); slower for specialized models.
#5 — MIG partitioning
A specific technique for sharing one big GPU across multiple workloads. Powerful when applicable; rarely the biggest lever.
Why it is #5 (lower)
MIG (Multi-Instance GPU) lets a single A100 80GB be partitioned into 7 smaller GPUs, each with its own dedicated memory and compute. It is great when:
- You have multiple workloads each needing modest GPU resources.
- A100 / H100 is the available GPU but workloads do not need a full one.
- You want isolation between tenants on shared hardware.
It is not great when:
- Your workloads each need a full GPU (training, large inference).
- You can use smaller GPUs directly (A10G, L4) for similar cost.
- The workload pattern changes often (MIG partitioning is fixed at boot).
What to do
Production GPU Infrastructure Module 3 covers MIG in depth. The heuristic:
- Use MIG when: A100 / H100 is what you have access to, multiple small workloads share the GPU, throughput per dollar is acceptable.
- Skip MIG when: smaller GPUs are available and cheaper per dollar.
Time investment
Configuration is straightforward (NVIDIA GPU Operator handles it). Workload migration is the bigger work. A few days per workload class.
#6 — GPU node taints
The smallest of the major fixes. Easy to do; small savings.
Why it is #6
Lesson 1.2 covered this. Pods that should not be on GPU nodes (logging agents, sidecars, occasional CI workloads) are 5-10% of the cost.
What to do
Add the taint:
# On the GPU node pool / NodePool
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
Add tolerations to GPU workloads. Watch for surprises (other DaemonSets you forgot about). Done in an hour.
Time investment
An hour of deploy time, plus a week of monitoring for fallout.
#7 — Long tail
After the top 6, the remaining 5-10% comes from a collection of smaller improvements:
Image optimization
Smaller images = faster pulls = less idle GPU time at startup. A 10 GiB inference image becomes 3 GiB with multi-stage builds and slim base images. 7 GiB less to pull on every cold start. For a service that scales often, this saves 1-3 minutes of idle GPU time per scale-up.
Network optimization
NodeLocal DNSCache (covered in K8s Architecture Module 7.3). VPC endpoints for ECR/S3 to avoid NAT charges. EFA for distributed training to reduce data transfer cost.
Storage optimization
Shared model weights via PVC instead of per-pod copies. EFS / FSx for Lustre for very-large-model serving. ECR pull-through caches.
Workload-specific tuning
Continuous batching for inference. CUDA Graph compilation. Tensor parallelism vs pipeline parallelism choices.
Each of these is a few percentage points. Do them after the top 6.
A 12-month plan
A practical roadmap for a team starting from "no GPU optimization at all":
Q1: autoscaling
- Install GPU-aware metrics.
- Set up HPA on custom metrics.
- Deploy Karpenter (or tune Cluster Autoscaler) for GPU node pools.
- Goal: 30%+ reduction in average GPU node count.
Q2: right-sizing
- Benchmark each workload class on smaller GPUs.
- Migrate workloads where benchmarks pass.
- Goal: another 25-35% reduction by GPU type changes.
Q3: capacity strategy
- Measure steady-state usage.
- Buy Savings Plans / RIs for steady-state.
- Roll out spot for training workloads with checkpointing.
- Goal: another 20-30% reduction via better-priced compute.
Q4: long tail and discipline
- Quantization where applicable.
- MIG where it fits.
- Per-workload cost attribution dashboards.
- Quarterly audit ritual (Module 5.3).
- Goal: 5-10% additional savings + culture change.
End of year: 60-75% reduction from starting point, sustainably. The case study in Module 5.3 ($180K → $67K = 63% reduction) followed roughly this curve.
What might not be worth doing
Some things that look like optimization but rarely justify the engineering cost:
Per-cloud arbitrage
Switching cloud providers because GPUs are slightly cheaper elsewhere. The migration cost almost always exceeds the savings.
Building your own compiler / kernels
Custom CUDA kernels or compiled inference engines. Real savings, but requires deep ML systems expertise. Better to use the open ecosystem (vLLM, TensorRT-LLM) which already optimizes the popular models.
On-prem GPU clusters
Buying your own H100s. The capex is enormous; the operational burden is real. Worth it for very large training operations (millions in monthly compute), almost never for inference.
"Sharing" GPUs without MIG
Running multiple workloads on the same GPU without MIG isolation. Works in research; in production, leads to noisy-neighbor issues, OOMs, and weird performance. Use MIG or dedicate GPUs.
The discipline to apply
The 80/20 only works if you actually do the top items. The patterns that fail:
- Doing #6 first because it is easy. Saves 5%, takes 30 minutes, feels good. Now the team thinks they have "done optimization" and the real wins go untouched.
- Tackling #4 (quantization) because it is intellectually interesting. Real engineers love it; ROI is mid; the harder, less interesting #1 (autoscaling) goes unsolved.
- Trying everything at once. Spread effort thin; nothing is done well. Pick one optimization per month; finish it; then the next.
The goal is impact per quarter, not coverage of techniques.
A team I helped showed me their "we are doing GPU optimization" plan. It had 27 items: image size reduction, custom CUDA kernels, MIG configuration, cloud arbitrage proposals, pinning containers to specific cores. After the audit, they had not implemented any of the top 3 optimizations. We tabled the 27-item list and focused on autoscaling for one month. End of month: 35% cost reduction. The 27 items were each worth 1-3 percent; the autoscaling was worth 35. Lesson: optimization is prioritization, not exhaustiveness.
Summary
The 80/20 priority for GPU optimization:
- GPU-aware autoscaling: 30-50% savings; the biggest lever.
- Right-size GPU types: 20-40% savings; per-workload one-time decision.
- Spot for training, savings plans for inference: 30-60% on the capacity it covers.
- Quantization: enables smaller GPUs; workload-specific.
- MIG: powerful for specific multi-tenant cases.
- GPU node taints: small but cheap.
- Long tail: image, network, storage, workload tuning.
A 12-month plan starts with autoscaling (Q1), right-sizing (Q2), capacity strategy (Q3), and discipline (Q4). End-of-year savings of 60-75% are realistic for an unoptimized starting cluster.
The discipline is to apply them in order and finish one before starting the next. Module 1 closes here; Module 2 begins the deep dive on the highest-leverage items, starting with right-sizing GPU types.