The DevOpsBeast Blog

Production engineering notes.

Field notes on Kubernetes, GPUs, Linux, and the rest of the production stack, from engineers who run real infrastructure.

RSS·subscribe in your feed reader

All (42)Container Internals (1)Container Security (1)DevOpsBeast (2)GPU (1)GPU Cost Optimization (1)GPU Infrastructure (3)Kubernetes (1)Kubernetes Architecture (1)Kubernetes Debugging (4)Kubernetes Networking (2)Kubernetes Operations (2)Kubernetes Performance (1)Kubernetes Security (2)Linux (2)LLM Infrastructure (7)Networking (2)Observability (1)Security (8)

LLM Infrastructure·Jun 10, 2026·13 min read

You Changed the Prompt. Is the Model Better or Worse? You Don't Have a Test That Tells You.

Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.

Read post

LLM Infrastructure·Jun 9, 2026·16 min read

kubectl drain Killed a 90-Second Inference Request. Stateless Drain Logic Doesn't Work for GPU Pods.

Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs (a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts), plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.

Read post

LLM Infrastructure·May 31, 2026·14 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

Read post

LLM Infrastructure·May 30, 2026·16 min read

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

Read post

LLM Infrastructure·May 28, 2026·16 min read

Your LLM Bill Tripled and Traffic Didn't. Welcome to Prompt Economics.

The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.

Read post

LLM Infrastructure·May 20, 2026·13 min read

Your Team Is Debating vLLM vs SGLang. The Performance Numbers Are Not the Decision.

Both engines hit similar throughput on similar hardware in 2026. The decision is workload shape (agents vs chat vs RAG), structured output needs, and operational maturity. Here is the honest production comparison.

Read post

LLM Infrastructure·May 19, 2026·20 min read

Your LLM Cluster Is at 90% HBM and 60% Is KV Cache. Welcome to the Disaggregation Cliff.

vLLM prefix caching is great. It stops at one node. When your fleet of 50 H100s is bottlenecked on KV cache and adding GPUs is not financially viable, the next architecture is disaggregated KV cache. Here is the wall, the math, Mooncake, and what to actually do on Monday.

Read post