The DevOpsBeast Blog

Production engineering notes.

Field notes on Kubernetes, GPUs, Linux, and the rest of the production stack, from engineers who run real infrastructure.

LLM Infrastructure··13 min read

You Changed the Prompt. Is the Model Better or Worse? You Don't Have a Test That Tells You.

Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.

Read post
LLM Infrastructure··16 min read

kubectl drain Killed a 90-Second Inference Request. Stateless Drain Logic Doesn't Work for GPU Pods.

Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs — a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts — plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.

Read post
Kubernetes Security··13 min read

A Pod in Your Cluster Just Got Compromised. Walk Me Through the Blast Radius.

One container gets popped — an RCE in an app, a malicious dependency, a leaked token. The junior answer is 'kill the pod.' The senior answer traces the blast radius: from the mounted ServiceAccount token to the API server, across a flat pod network to the cloud metadata endpoint, and through a privileged pod to the node and every secret on it. The attacker's path layer by layer, and the single control that caps the damage at each one — the difference between 'one pod' and 'whole cluster.'

Read post
DevOpsBeast··11 min read

Most Courses Teach Tools. Senior DevOps Interviews Test Architecture. Here's the Gap.

After 50+ senior DevOps interviews on both sides of the table, the same pattern keeps repeating: courses teach tools, interviews test architecture, and strong operators freeze the moment a question turns from 'what does this do' to 'design this and defend it.' The five reasoning questions senior candidates actually fail, what a knowledge answer looks like versus a reasoning answer, and how to close the gap.

Read post
GPU Cost Optimization··14 min read

Spot H100s Are 70% Cheaper. Most Teams Use Them Wrong and Pay More.

Spot GPUs are the single biggest cost lever you have — and the fastest way to turn a savings story into a reliability incident. The team that runs everything on spot eats a preemption, sees 503s, migrates back to on-demand, and triples the bill without ever asking whether the original setup was wrong. The real model: what a preemption actually costs, which workloads win on spot and which never should, the per-cloud warning windows, and the 70/30 baseline-plus-spot mix that cuts the bill 40-55% with no SLO hit — if the drain logic is correct.

Read post
LLM Infrastructure··14 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

Read post
GPU Infrastructure··17 min read

Your GPU Dashboard Says 100% Utilized. It's Lying. Welcome to DCGM.

Every post about GPU incidents starts with 'the dashboards looked fine.' That's the problem. nvidia-smi GPU utilization tells you a kernel ran — not whether the silicon is doing work. The metrics that actually matter, the DCGM + Prometheus stack that exposes them, and the queries and alerts that catch real GPU failures.

Read post
LLM Infrastructure··16 min read

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

Read post
LLM Infrastructure··16 min read

Your LLM Bill Tripled and Traffic Didn't. Welcome to Prompt Economics.

The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.

Read post