The DevOpsBeast Blog

Production engineering notes.

Field notes on Kubernetes, GPUs, Linux, and the rest of the production stack, from engineers who run real infrastructure.

RSS·subscribe in your feed reader

All (42)Container Internals (1)Container Security (1)DevOpsBeast (2)GPU (1)GPU Cost Optimization (1)GPU Infrastructure (3)Kubernetes (1)Kubernetes Architecture (1)Kubernetes Debugging (4)Kubernetes Networking (2)Kubernetes Operations (2)Kubernetes Performance (1)Kubernetes Security (2)Linux (2)LLM Infrastructure (7)Networking (2)Observability (1)Security (8)

Container Security·Jul 16, 2026·6 min read

Mounting the Docker Socket Is Root on Your Host. Here Is Why.

A CI job mounts /var/run/docker.sock so it can build images. An attacker who compromises that job uses the socket to start a privileged container that mounts the host root filesystem. No exploit required. This is one of the most common real-world container escapes, and the fix is to stop needing the socket at all.

Container Internals·Jul 15, 2026·8 min read

There Is No Container: What Actually Runs When You docker run

There is no container object in the Linux kernel. A container is a normal process with four restrictions applied to it. Once you see the four, every container problem you have ever hit, an escape, an OOM kill, a permission error, image bloat, becomes a question about one specific piece.

Kubernetes Architecture·Jul 5, 2026·10 min read

"Walk Me Through What Happens When You Create a Pod." It's Also How You Debug One.

The canonical senior Kubernetes interview question has a twelve-step answer, from kubectl apply to a Ready pod. The same twelve steps are the map you walk backward every time a pod is stuck. Learn the chain once and you get the interview answer and the debugging flow for free.

Observability·Jun 23, 2026·10 min read

One Label Added Four Million Series to Your Prometheus. Here Is the Math.

A developer adds user_id to one counter to debug a support ticket. A week later Prometheus is eating 60 GB of RAM and queries time out. This is cardinality, the hidden cost center of metrics, and the math that predicts the disaster before it happens.

Kubernetes Security·Jun 14, 2026·12 min read

Your NetworkPolicy Controls the Front Door and Leaves the Building Through the Back.

Most Kubernetes NetworkPolicies restrict who can reach a service and stop there: ingress locked, egress wide open. But the blast radius of a compromised pod is almost entirely an egress story: lateral movement, the cloud metadata endpoint, data exfiltration, calling home. Default-deny in both directions, the DNS gotcha that breaks everything the moment you turn egress on, blocking 169.254.169.254 with an ipBlock except, the CNI that has to actually enforce it, and the audit-then-enforce rollout that doesn't take production down.

LLM Infrastructure·Jun 10, 2026·13 min read

You Changed the Prompt. Is the Model Better or Worse? You Don't Have a Test That Tells You.

Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.

LLM Infrastructure·Jun 9, 2026·16 min read

kubectl drain Killed a 90-Second Inference Request. Stateless Drain Logic Doesn't Work for GPU Pods.

Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs (a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts), plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.

Kubernetes Security·Jun 9, 2026·13 min read

A Pod in Your Cluster Just Got Compromised. Walk Me Through the Blast Radius.

One container gets popped: an RCE in an app, a malicious dependency, a leaked token. The junior answer is 'kill the pod.' The senior answer traces the blast radius: from the mounted ServiceAccount token to the API server, across a flat pod network to the cloud metadata endpoint, and through a privileged pod to the node and every secret on it. The attacker's path layer by layer, and the single control that caps the damage at each one: the difference between 'one pod' and 'whole cluster.'

DevOpsBeast·Jun 9, 2026·11 min read

Most Courses Teach Tools. Senior DevOps Interviews Test Architecture. Here's the Gap.

After 50+ senior DevOps interviews on both sides of the table, the same pattern keeps repeating: courses teach tools, interviews test architecture, and strong operators freeze the moment a question turns from 'what does this do' to 'design this and defend it.' The five reasoning questions senior candidates actually fail, what a knowledge answer looks like versus a reasoning answer, and how to close the gap.

GPU Cost Optimization·Jun 9, 2026·14 min read

Spot H100s Are 70% Cheaper. Most Teams Use Them Wrong and Pay More.

Spot GPUs are the single biggest cost lever you have, and the fastest way to turn a savings story into a reliability incident. The team that runs everything on spot eats a preemption, sees 503s, migrates back to on-demand, and triples the bill without ever asking whether the original setup was wrong. The real model: what a preemption actually costs, which workloads win on spot and which never should, the per-cloud warning windows, and the 70/30 baseline-plus-spot mix that cuts the bill 40 to 55% with no SLO hit, provided the drain logic is correct.

LLM Infrastructure·May 31, 2026·14 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

GPU Infrastructure·May 30, 2026·17 min read

Your GPU Dashboard Says 100% Utilized. It's Lying. Welcome to DCGM.

Every post about GPU incidents starts with 'the dashboards looked fine.' That's the problem. nvidia-smi GPU utilization tells you a kernel ran — not whether the silicon is doing work. The metrics that actually matter, the DCGM + Prometheus stack that exposes them, and the queries and alerts that catch real GPU failures.

LLM Infrastructure·May 30, 2026·16 min read

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

LLM Infrastructure·May 28, 2026·16 min read

Your LLM Bill Tripled and Traffic Didn't. Welcome to Prompt Economics.

The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.

GPU Infrastructure·May 25, 2026·14 min read

Your H100 Serves Three Teams Now. MIG or Time-Slicing? Pick Wrong and the Answer Hurts.

MIG is hardware partitioning. Time-slicing is software multiplexing. They are not interchangeable. The production decision walk-through, the H100 profile math, the GPU Operator config, and the migration path most teams hit.

LLM Infrastructure·May 20, 2026·13 min read

Your Team Is Debating vLLM vs SGLang. The Performance Numbers Are Not the Decision.

Both engines hit similar throughput on similar hardware in 2026. The decision is workload shape (agents vs chat vs RAG), structured output needs, and operational maturity. Here is the honest production comparison.

LLM Infrastructure·May 19, 2026·20 min read

Your LLM Cluster Is at 90% HBM and 60% Is KV Cache. Welcome to the Disaggregation Cliff.

vLLM prefix caching is great. It stops at one node. When your fleet of 50 H100s is bottlenecked on KV cache and adding GPUs is not financially viable, the next architecture is disaggregated KV cache. Here is the wall, the math, Mooncake, and what to actually do on Monday.

DevOpsBeast·May 19, 2026·8 min read

Why I Built DevOpsBeast (and Who It's Not For)

DevOpsBeast is not for everyone. It is built for one specific kind of engineer with one specific problem. Here is who it is for, who it is not for, and why I built it.

Security·May 15, 2026·12 min read

Your JWT Is Not a Session. The Costliest Misuse of OAuth in 2026.

JWTs were designed for short-lived authorization assertions. Half the industry uses them as session cookies, then discovers they cannot revoke. The five problems this causes and the right alternative.

Security·May 15, 2026·11 min read

MFA Fatigue Bypassed Uber, MGM, and Cisco. Number Matching Stops It in One Config Change.

MFA fatigue is the cheapest, most-effective attack against push-based MFA in 2026. The defense is one IdP config change. Here is the attack, the defense, and why most companies still have not enabled it.

Security·May 15, 2026·11 min read

PKCE: Why Every OAuth Client Needs It in 2026 (Even the Ones That Used to Be Fine Without)

PKCE used to be a mobile-only thing. OAuth 2.1 makes it mandatory for everyone. Here is what the protection actually does, why a confidential web app needs it too, and the eight-line implementation that closes the authorization-code-interception attack.

Security·May 15, 2026·11 min read

How Auth0 Detects Stolen Refresh Tokens (and Why You Should Implement the Same)

Refresh-token rotation is a known good practice. The 'reuse detection' that goes with it is what actually catches stolen tokens. Here is how the mechanism works and how to implement it correctly.

Security·May 14, 2026·11 min read

How GitHub Actions OIDC to AWS Actually Works (and the Eight Ways It Breaks)

AssumeRoleWithWebIdentity returns AccessDenied. The OIDC token looks valid. The trust policy looks right. The error message is useless. Eight specific causes, eight specific fixes, and a diagnostic that finds the right one in 30 seconds.

GPU Infrastructure·May 8, 2026·12 min read

Your 8B Model Won't Fit on an A100 With 50GB Free. Welcome to GPU Memory Fragmentation.

The model weights are 16GB. The KV cache is 20GB. The A100 has 80GB. nvidia-smi shows 50GB free. The next request OOMs. The CUDA memory allocator's fragmentation story most ML engineers never learn.

Kubernetes Debugging·May 8, 2026·13 min read

Your Liveness Probe Is Killing Your Pod Mid-Boot. The Six Probe Mistakes That Cause Real Outages.

Liveness probes that fire before your app is ready. Readiness probes that check the database. Exec probes leaking zombie processes by the thousands. The six mistakes that turn health checks into the cause of the outage they were supposed to prevent.

Kubernetes Operations·May 8, 2026·10 min read

kubectl drain Has Been Running for 4 Hours. Your PodDisruptionBudget Is Why.

Drains hang forever when a PodDisruptionBudget can never be satisfied. The four trap configurations, how to diagnose which one is biting, and the right PDB design that does not break node maintenance.

Networking·May 8, 2026·13 min read

Your Service Worked at 1,000 RPS. At 3,000 It Started Failing With 'Connection Refused'. Welcome to TIME_WAIT and conntrack.

Two unrelated kernel limits bite high-throughput Kubernetes services: ephemeral port exhaustion from TIME_WAIT and conntrack table overflow. Same symptom, different root causes, different fixes.

Kubernetes Networking·May 7, 2026·13 min read

Your Cluster Has 5,000 Services and kube-proxy Is the Bottleneck. Welcome to the iptables Cliff.

Every Service create rewrites your entire iptables chain. At small scale you never notice. At 5,000 Services kube-proxy is at 100% CPU, Service updates take 30 seconds, and your latency p99 is in the seconds. Here is the cliff and how to fall off it.

Kubernetes Debugging·May 7, 2026·13 min read

Your Critical Pod Got OOMKilled. The Pod That Caused It Is Still Running. Here Is Why.

A node runs out of memory. The kernel and the kubelet both pick which pod to kill. Neither of them picks the leaky one. They pick the well-behaved BestEffort pod next door. The QoS, oom_score_adj, and eviction-priority story most engineers never learn.

Security·May 6, 2026·11 min read

cert-manager Renewed Your Certificate. Your App Still Serves the Old One. Why?

The certificate in the Secret is fresh. The pod is still serving the expired one. cert-manager did its job. Your app did not. The five renewal failures that bite production.

Kubernetes Operations·May 6, 2026·9 min read

etcd Is Slowing Down Your Cluster: Compaction, Defrag, and the 2GB Wall

Your API server latency p99 is rising. etcd disk usage is creeping toward the 2GB quota. Compaction has run, defrag has not, and your cluster is one write spike away from a no-space-left-on-device outage.

Linux·May 6, 2026·11 min read

Too Many Open Files: The Linux Limit That Crashes Production at 3 AM

Your service runs fine for weeks, then suddenly fails with 'too many open files' under load. Three layers of fd limits, why the wrong one bites first, and how to set them so this stops happening.

Networking·May 6, 2026·10 min read

Your gRPC Connection Worked for an Hour, Then Stopped. Welcome to Keepalive Hell.

gRPC connections silently die behind load balancers, NAT gateways, and idle timeouts. The keepalive settings that prevent this are documented separately on every side and you need all four to agree.

Security·May 6, 2026·12 min read

Your JWT Validation Is Broken. Here Are the Eight Bugs That Caused Real Breaches.

JWTs look simple: a signed JSON blob, verify the signature, trust the claims. Almost every step of that has a known bug pattern that has caused real production breaches. Here is the catalog.

Kubernetes Debugging·May 5, 2026·13 min read

Your Kubernetes Cluster Just Died at 2 AM: The Certificate Nobody Was Watching

Kubernetes certificates expire silently. No warning, no alert, no graceful degradation, just a dead cluster. Here is how to fix it in five minutes and how to make sure it never happens again.

Kubernetes Performance·May 5, 2026·10 min read

Your Pod Is Using 5% CPU and Still Throttled. Here Is Why.

CPU limits in Kubernetes do not mean what you think they mean. A tour of CFS quota, the 100ms scheduling period, and why your latency spikes look nothing like CPU saturation.

Kubernetes Networking·May 5, 2026·9 min read

Why Every Kubernetes Cluster Makes 5 DNS Queries For One Lookup

ndots:5 is the silent latency killer in Kubernetes. Every external hostname resolution generates four wasted queries before the right one. Here is why, and how to fix it.

Security·May 5, 2026·13 min read

OAuth 2.0 Flows in Production: Which One You Should Actually Use

Auth Code, Implicit, Client Credentials, Device Code, Resource Owner Password. Most engineers know the names. Few know which one fits which problem and why three of them are now considered insecure.

Kubernetes Debugging·Apr 26, 2026·12 min read

How to Debug Kubernetes OOMKilled (Exit Code 137): The Complete Guide

Three completely different problems hide behind exit code 137. Most engineers fix the wrong one and the pod keeps crashing.

Linux·Apr 25, 2026·8 min read

cgroups, Pod Memory Limits, and What Actually Gets Counted

Your pod's memory limit isn't measuring what you think it is. A tour of cgroup v2 accounting and the surprises hiding inside memory.current.

GPU·Apr 24, 2026·9 min read

Tuning vLLM gpu_memory_utilization Without Breaking Production

The default 0.9 is wrong for almost every production deployment. Here's how to pick the right number for your model, GPU, and traffic shape.

Kubernetes·Apr 23, 2026·9 min read

The Kubernetes Upgrade Preflight Checklist

Every Kubernetes upgrade I've watched fail in production failed for a reason that was visible an hour earlier. Here's the checklist.