Spot H100s Are 70% Cheaper. Most Teams Use Them Wrong and Pay More.
Spot GPUs are the single biggest cost lever you have — and the fastest way to turn a savings story into a reliability incident. The team that runs everything on spot eats a preemption, sees 503s, migrates back to on-demand, and triples the bill without ever asking whether the original setup was wrong. The real model: what a preemption actually costs, which workloads win on spot and which never should, the per-cloud warning windows, and the 70/30 baseline-plus-spot mix that cuts the bill 40-55% with no SLO hit — if the drain logic is correct.
Spot H100s are 60-70% cheaper than on-demand. That number is real, it is not a promotional rate, and it is the single biggest line-item lever in any GPU budget. It is also how most teams end up paying more than if they had never touched spot at all.
The arc is always the same. Someone sees the per-hour rate, does the obvious math, and moves the inference fleet to spot to capture the saving. It works for a few weeks. Then the cloud reclaims a node with no graceful handling in place, in-flight requests drop mid-stream, customers see 503s, and three engineers spend an afternoon failing over by hand. The postmortem writes the lesson as "do not use spot for production." The fleet migrates back to all on-demand, the bill triples, and nobody goes back to ask the real question: was the saving wrong, or was the setup wrong?
The setup was wrong. Spot economics are real, but they are not free money — they are a trade of price for a probability of interruption, and capturing the price means engineering for the interruption. This post is the actual model: what a preemption genuinely costs you, which workloads win on spot and which must never run on it, the per-cloud mechanics that decide how much warning you get, and the hybrid baseline-plus-spot mix that production ML platforms actually run.
Why "spot for everything" costs more than on-demand#
The naive model treats the price difference as the whole story: spot is 65% cheaper, so spot saves 65%. That is only true if every spot hour you pay for produces useful work. A preemption breaks that assumption in two ways, and both cost real money.
Wasted compute since the last recoverable point. When a node is reclaimed, any work that has not been persisted is gone. For a training job, that is every minute of compute since the last checkpoint. For batch inference, it is the in-flight batch. You paid the (discounted) spot rate for those GPU-hours and got nothing for them — so your effective cost per useful hour is higher than the sticker rate, and how much higher depends entirely on how often you are preempted and how much you lose each time.
The recovery tax. A reclaimed inference pod does not just stop — it has to come back, and a GPU pod's restart is not free. It pays the full cold-start sequence (image pull, weight load into HBM, CUDA graph build, warmup) — the same 30-to-90-second tax the LLM autoscaling post is built around. During that window the replacement consumes a GPU and serves nothing, and the load it was carrying queues elsewhere. Preempt often enough and you spend more GPU-hours recovering than you saved on the rate.
Spot's sticker discount is a price per paid hour. What you actually care about is price per useful hour — paid hours minus the work thrown away on each preemption, plus the GPU-hours burned recovering. The gap between those two numbers is set by your preemption rate and your recovery cost, not by the cloud's price sheet. Run a workload that loses a lot per interruption, or interrupts often, and the effective rate can climb above on-demand. "Spot is cheaper" is a property of the workload, not of the instance.
The cost model that actually decides it#
Whether spot wins comes down to four numbers, and you can reason about all of them before you provision anything:
- Preemption rate — how often the cloud reclaims your capacity. Varies wildly by region, instance type, and time of day. Scarce GPU SKUs (H100, H200) in busy regions get reclaimed far more than abundant ones (T4, A10G).
- Loss per preemption — how much useful work evaporates each time. A job that checkpoints every 5 minutes loses at most 5 minutes. A job that never checkpoints loses everything since it started.
- Recovery cost — the GPU-hours and wall-clock spent getting back to serving after an interruption. Dominated by cold start for inference, by re-reading the last checkpoint for training.
- Error budget — how many user-visible failures you can absorb. This is the constraint that turns a cost decision into a reliability decision, and it is the one teams skip.
The mental shorthand: effective cost per useful hour ≈ spot rate ÷ (fraction of paid time that produces durable work). If you checkpoint constantly and recover cheaply, that fraction is near 1.0 and you capture almost the full discount. If a quarter of your paid hours are lost work and recovery, your "65% off" is really more like 50% off — still a win. If preemptions are frequent and each one is expensive, the fraction collapses and spot loses. The whole discipline is pushing that fraction toward 1.0 by making preemptions cheap to absorb, not by avoiding them.
Workloads where spot wins#
Training with checkpointing. This is the canonical spot workload. Write model state to durable storage every N minutes; a preemption means resuming from the last checkpoint, not from zero. The loss-per-preemption is bounded by your checkpoint interval, the recovery cost is one checkpoint read, and the 65% discount covers the wasted compute many times over. The course's training playbook is built on exactly this: checkpoint frequently enough that the lost work is always small relative to the saving.
Batch inference. Pull inputs from a queue, write outputs to durable storage, acknowledge only on completion. A preemption means the in-flight batch gets reprocessed by another worker — at-least-once semantics you were probably designing for anyway. Nobody is waiting on a live connection, so there is no SLO to break. Near-ideal for spot.
Real-time inference with graceful preemption. This is the case teams get wrong, and it is recoverable. Modern spot offerings give advance warning of reclamation (see the per-cloud table below). That warning is enough for your inference engine to drain: stop admitting new requests, finish the in-flight batch, exit clean — exactly the graceful-shutdown sequence in Draining GPU Nodes Without Losing In-Flight Inference. The preemption signal triggers a cordon-and-drain, the drain runs the preStop-and-grace-period dance, and most reclamations become invisible. Spot for production inference is viable only once that drain logic is correct — it is the precondition, not an optimization.
Multi-zone diversification. Spot capacity and preemption are per-availability-zone: a capacity crunch that reclaims your us-east-1a nodes often leaves 1b and 1c untouched. A fleet spread across 3-5 AZs absorbs most preemption events without dropping below the floor, because they rarely all get hit at once. Diversification is the cheapest reliability you can buy on spot.
Workloads where on-demand is correct#
Spot is a lever, not a religion. Some workloads should never run on it:
- Single-instance production inference with no failover. One pod, one GPU, real users, nothing to fail over to. A preemption is an outage. Either add replicas and drain logic to make it spot-eligible, or keep it on-demand — but do not run a singleton on spot and hope.
- Distributed training without checkpointing. In synchronous data-parallel training, losing one worker can stall or fail the whole step. Without checkpointing, one reclaimed node can cost you the entire run. Add checkpointing first, then revisit spot.
- Anything whose work exceeds the warning window. If your in-flight unit of work — a long-context generation, an agent loop, a batch item — takes longer than the cloud's preemption notice, graceful drain cannot finish it in time, and you are back to dropped work. Match the workload to the window, or stay on-demand.
The deciding question for production inference is not "is spot cheaper" — it is "is my longest in-flight request shorter than my shortest preemption warning?" AWS gives ~2 minutes; GCP and Azure give ~30 seconds. If your p99 generation runs 90 seconds, you can drain gracefully on AWS but not on a 30-second GCP/Azure notice — there, the tail of long requests gets SIGKILLed on every preemption no matter how good your preStop hook is. Spot eligibility is a function of your request-length distribution against the warning window, per cloud. Check it before you assume the discount.
Per-cloud preemption mechanics#
The warning window is the variable that decides whether real-time inference can run on spot, and it differs sharply by provider. Approximate 2026 behavior:
| Provider | Spot product | Reclamation notice | Notes |
|---|---|---|---|
| AWS | EC2 Spot | ~2 minutes | Plus earlier Capacity Rebalance recommendations you can act on before the hard notice. The most drain-friendly window. |
| GCP | Spot VMs | ~30 seconds | Short. Long requests will not finish inside it; size workloads accordingly. |
| Azure | Spot VMs | ~30 seconds | Similar to GCP. Eviction can be capacity- or price-based. |
| CoreWeave / GPU clouds | Varies | Provider-specific | Specialized GPU clouds often use reserved/committed models instead of classic spot; read the actual interruption SLA. |
Two practical consequences. First, the same workload can be spot-eligible on AWS and spot-ineligible on GCP/Azure purely because of the window — your placement policy has to be per-cloud, not global. Second, act on the early signal, not just the hard notice. AWS Capacity Rebalance recommendations arrive before the 2-minute clock; wiring your node-termination handler to cordon and pre-drain on the recommendation buys you margin the hard notice does not.
The hybrid pattern: on-demand floor, spot ceiling#
The mix nearly every production ML platform converges on: a small on-demand baseline that guarantees the floor, and spot capacity that scales on top of it. Run roughly 70-80% spot, 20-30% on-demand. On-demand handles your minimum always-on capacity — the replicas that must exist for the SLO even if every spot node in the region vanishes at once. Spot absorbs everything above that floor: the diurnal peaks, the batch surges, the elastic demand. Done right, total spend drops 40-55% versus all-on-demand with no SLO impact, because the on-demand floor catches the worst case while spot captures the discount on the bulk.
With Karpenter, you express this as two NodePools — an on-demand baseline weighted to be chosen first up to a cap, and a spot pool for the rest:
# Baseline: on-demand, preferred, capped at the SLO floor
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-baseline-ondemand
spec:
weight: 100 # prefer this pool first
limits:
nvidia.com/gpu: "4" # only enough on-demand for the floor
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["p5.48xlarge"] # H100
taints:
- key: nvidia.com/gpu
effect: NoSchedule
---
# Ceiling: spot, fills everything above the floor, diversified across types/AZs
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-burst-spot
spec:
weight: 10 # only used after the baseline cap is hit
limits:
nvidia.com/gpu: "40"
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
budgets:
- nodes: "20%" # never disrupt more than 20% at once
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["p5.48xlarge", "p5e.48xlarge"] # diversify across SKUs
taints:
- key: nvidia.com/gpu
effect: NoSchedule
The weight ordering makes Karpenter fill the cheap-to-reason-about on-demand floor first (up to its limits), then spill to spot for the burst. The spot pool's disruption budgets cap how much capacity churns at once during consolidation, and listing multiple instance types diversifies away from a single SKU's preemption rate. The whole thing only holds together if the pods themselves drain gracefully — which is why the graceful-drain pattern is the load-bearing dependency under any spot strategy, not a separate concern.
Set the on-demand baseline to your true SLO floor under simultaneous regional spot loss, not your average load. The question to size it against is: "if every spot node disappeared in the next 30 seconds, how much capacity must already be running on-demand for me to still meet my SLO?" That number — usually 20-30% of peak — is your floor. Everything above it is fair game for spot. Size the floor too low and a correlated preemption breaches the SLO; too high and you are leaving the discount on the table.
Reserved and committed capacity: the third axis#
Spot vs on-demand is not the whole picture. For the predictable part of your baseline — the floor that runs 24/7 regardless of demand — neither spot nor on-demand is the cheapest option: reserved instances or savings plans are. The full hierarchy production platforms run is three-tier: commit (reserved/savings plans) for the always-on floor, on-demand for the unpredictable-but-must-not-fail middle, and spot for the elastic, interruption-tolerant top. Commitments trade flexibility for a discount comparable to spot without the interruption risk — but only pay off if you actually run that capacity for the commitment term. The sizing math (how much to commit, for how long, against a usage floor you can defend) is its own discipline, covered in the course alongside the spot mechanics here.
Common mistakes#
Treating the sticker discount as the realized saving. 65% off the rate is not 65% off the bill unless every paid hour produces durable work. Model effective cost per useful hour, including lost work and recovery.
Spot for everything, drain logic for nothing. The classic failure: capture the rate, skip the engineering, eat an incident on the first preemption, then over-correct to all on-demand. Graceful drain is the precondition for production spot, not a nice-to-have.
Ignoring the warning-window-vs-request-length match. A 90-second request cannot drain inside a 30-second GCP/Azure notice. Spot eligibility for real-time inference is per-cloud and per-workload, decided by this comparison.
No on-demand floor. All-spot means a correlated regional preemption can take your whole fleet below SLO at once. Always keep an on-demand (or reserved) baseline sized for the worst case.
Single instance type, single AZ. Concentrating spot demand on one SKU in one zone maximizes your preemption rate. Diversify across instance types and 3-5 AZs so reclamations are uncorrelated.
Running a production singleton on spot. One pod, one GPU, no failover — a preemption is guaranteed downtime. Add replicas and drain logic, or keep it on-demand.
No checkpointing on spot training. Without it, every preemption costs the entire run-so-far. Checkpoint frequently enough that lost work is always small next to the saving — that is what makes training the best spot workload there is.
Forgetting reserved capacity for the floor. The always-on baseline is cheapest on a commitment, not on-demand. Spot is for the elastic top of the stack, not the predictable bottom.
The mental model#
On-demand is renting capacity with a guarantee; spot is renting the same capacity without the guarantee, at a discount that prices in the probability it gets taken back. That framing makes every decision fall out cleanly. The discount is only worth taking when you can make an interruption cheap — by checkpointing, by draining gracefully, by diversifying so reclamations are uncorrelated, by keeping a guaranteed floor underneath. Do that work and spot is the single largest lever in your GPU budget, routinely cutting the bill in half. Skip it and spot is a discount that bills you back as incidents.
The teams that fail with spot are not wrong about the price — they are right about the price and wrong about the assumption that the price is the whole transaction. The price buys you the GPU; the engineering buys you the right to keep the discount when the cloud asks for the GPU back. Workload-aware placement — spot where interruption is cheap, on-demand for the floor, reserved for the predictable base, and graceful drain underneath all of it — is not a tuning detail. It is the difference between spot being your biggest saving and spot being your next postmortem.
The full GPU cost model — the true cost of a GPU-hour, the three forms of waste, right-sizing GPU types, spot and reserved-capacity strategy, the checkpointing playbook for spot training, and per-workload cost attribution with Kubecost/OpenCost — is the GPU Cost Optimization course. The GPU foundations and node lifecycle beneath spot placement are the Production GPU Infrastructure course, and the inference-serving patterns that have to survive preemption are the LLM Inference on Kubernetes course. Related reading: Draining GPU Nodes Without Losing In-Flight Inference for the graceful-shutdown logic that makes production spot possible, Your HPA Scales LLM Pods on CPU for the warm-buffer and node-loop strategy that absorbs preemptions, and MIG vs Time-Slicing for the other big GPU cost lever — packing more workloads onto the hardware you already pay for.