Production LLM Inference on Kubernetes
Deep production knowledge for engineers running LLM inference on self-managed Kubernetes. vLLM optimization, gateway architecture, observability, debugging, and cost modeling, all from real H100 production deployments. Lifetime updates included.
What you'll learn
Curriculum
5 modules · 18 lessonsThe Production Inference Stack
The two-layer architecture that every production LLM stack converges on, the lifecycle of a single request, and the metrics that actually predict production health.
Single-GPU Optimization with vLLM
vLLM configuration, prefill vs decode dynamics, KV cache management, and quantization, everything you need to get maximum throughput out of one GPU.
Gateway Architecture on Kubernetes
Why you need a gateway, streaming APIs done right, routing and failover patterns, and multi-tenant gateway design on Kubernetes.
Observability & Debugging
How to actually see what your inference stack is doing: separating gateway and engine signals, debugging tail latency, chasing throughput regressions, and a playbook for the 3am pages.
Cost Modeling at Scale
The math that turns 'feels expensive' into a real budget. Cost per token, concurrency economics, and a playbook of optimizations that actually cut the bill.
About the Author

Sharon Sahadevan
AI Infrastructure Engineer
Building production GPU clusters on Kubernetes. H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.
10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.
Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.