The DevOpsBeast Blog

Production engineering notes.

Field notes on Kubernetes, GPUs, Linux, and the rest of the production stack, from engineers who run real infrastructure.

GPU Cost Optimization··14 min read

Spot H100s Are 70% Cheaper. Most Teams Use Them Wrong and Pay More.

Spot GPUs are the single biggest cost lever you have — and the fastest way to turn a savings story into a reliability incident. The team that runs everything on spot eats a preemption, sees 503s, migrates back to on-demand, and triples the bill without ever asking whether the original setup was wrong. The real model: what a preemption actually costs, which workloads win on spot and which never should, the per-cloud warning windows, and the 70/30 baseline-plus-spot mix that cuts the bill 40-55% with no SLO hit — if the drain logic is correct.

Read post