Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.
Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs — a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts — plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.
One container gets popped — an RCE in an app, a malicious dependency, a leaked token. The junior answer is 'kill the pod.' The senior answer traces the blast radius: from the mounted ServiceAccount token to the API server, across a flat pod network to the cloud metadata endpoint, and through a privileged pod to the node and every secret on it. The attacker's path layer by layer, and the single control that caps the damage at each one — the difference between 'one pod' and 'whole cluster.'
After 50+ senior DevOps interviews on both sides of the table, the same pattern keeps repeating: courses teach tools, interviews test architecture, and strong operators freeze the moment a question turns from 'what does this do' to 'design this and defend it.' The five reasoning questions senior candidates actually fail, what a knowledge answer looks like versus a reasoning answer, and how to close the gap.
Spot GPUs are the single biggest cost lever you have — and the fastest way to turn a savings story into a reliability incident. The team that runs everything on spot eats a preemption, sees 503s, migrates back to on-demand, and triples the bill without ever asking whether the original setup was wrong. The real model: what a preemption actually costs, which workloads win on spot and which never should, the per-cloud warning windows, and the 70/30 baseline-plus-spot mix that cuts the bill 40-55% with no SLO hit — if the drain logic is correct.
Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.
Every post about GPU incidents starts with 'the dashboards looked fine.' That's the problem. nvidia-smi GPU utilization tells you a kernel ran — not whether the silicon is doing work. The metrics that actually matter, the DCGM + Prometheus stack that exposes them, and the queries and alerts that catch real GPU failures.
The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.
The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.
MIG is hardware partitioning. Time-slicing is software multiplexing. They are not interchangeable. The production decision walk-through, the H100 profile math, the GPU Operator config, and the migration path most teams hit.
Both engines hit similar throughput on similar hardware in 2026. The decision is workload shape (agents vs chat vs RAG), structured output needs, and operational maturity. Here is the honest production comparison.
vLLM prefix caching is great. It stops at one node. When your fleet of 50 H100s is bottlenecked on KV cache and adding GPUs is not financially viable, the next architecture is disaggregated KV cache. Here is the wall, the math, Mooncake, and what to actually do on Monday.
DevOpsBeast is not for everyone. It is built for one specific kind of engineer with one specific problem. Here is who it is for, who it is not for, and why I built it.
JWTs were designed for short-lived authorization assertions. Half the industry uses them as session cookies, then discovers they cannot revoke. The five problems this causes and the right alternative.
MFA fatigue is the cheapest, most-effective attack against push-based MFA in 2026. The defense is one IdP config change. Here is the attack, the defense, and why most companies still have not enabled it.
PKCE used to be a mobile-only thing. OAuth 2.1 makes it mandatory for everyone. Here is what the protection actually does, why a confidential web app needs it too, and the eight-line implementation that closes the authorization-code-interception attack.
Refresh-token rotation is a known good practice. The 'reuse detection' that goes with it is what actually catches stolen tokens. Here is how the mechanism works and how to implement it correctly.
AssumeRoleWithWebIdentity returns AccessDenied. The OIDC token looks valid. The trust policy looks right. The error message is useless. Eight specific causes, eight specific fixes, and a diagnostic that finds the right one in 30 seconds.
The model weights are 16GB. The KV cache is 20GB. The A100 has 80GB. nvidia-smi shows 50GB free. The next request OOMs. The CUDA memory allocator's fragmentation story most ML engineers never learn.
Liveness probes that fire before your app is ready. Readiness probes that check the database. Exec probes leaking zombie processes by the thousands. The six mistakes that turn health checks into the cause of the outage they were supposed to prevent.
Drains hang forever when a PodDisruptionBudget can never be satisfied. The four trap configurations, how to diagnose which one is biting, and the right PDB design that does not break node maintenance.
Two unrelated kernel limits bite high-throughput Kubernetes services: ephemeral port exhaustion from TIME_WAIT and conntrack table overflow. Same symptom, different root causes, different fixes.
Every Service create rewrites your entire iptables chain. At small scale you never notice. At 5,000 Services kube-proxy is at 100% CPU, Service updates take 30 seconds, and your latency p99 is in the seconds. Here is the cliff and how to fall off it.
A node runs out of memory. The kernel and the kubelet both pick which pod to kill. Neither of them picks the leaky one. They pick the well-behaved BestEffort pod next door. The QoS, oom_score_adj, and eviction-priority story most engineers never learn.
The certificate in the Secret is fresh. The pod is still serving the expired one. cert-manager did its job. Your app did not. The five renewal failures that bite production.
Your API server latency p99 is rising. etcd disk usage is creeping toward the 2GB quota. Compaction has run, defrag has not, and your cluster is one write spike away from a no-space-left-on-device outage.
Your service runs fine for weeks, then suddenly fails with 'too many open files' under load. Three layers of fd limits, why the wrong one bites first, and how to set them so this stops happening.
gRPC connections silently die behind load balancers, NAT gateways, and idle timeouts. The keepalive settings that prevent this are documented separately on every side and you need all four to agree.
JWTs look simple: a signed JSON blob, verify the signature, trust the claims. Almost every step of that has a known bug pattern that has caused real production breaches. Here is the catalog.
Kubernetes certificates expire silently. No warning, no alert, no graceful degradation, just a dead cluster. Here is how to fix it in five minutes and how to make sure it never happens again.
CPU limits in Kubernetes do not mean what you think they mean. A tour of CFS quota, the 100ms scheduling period, and why your latency spikes look nothing like CPU saturation.
ndots:5 is the silent latency killer in Kubernetes. Every external hostname resolution generates four wasted queries before the right one. Here is why, and how to fix it.
Auth Code, Implicit, Client Credentials, Device Code, Resource Owner Password. Most engineers know the names. Few know which one fits which problem and why three of them are now considered insecure.