Advanced|9 hours|18 lessons

Production LLM Inference on Kubernetes

Deep production knowledge for engineers running LLM inference on self-managed Kubernetes. vLLM optimization, gateway architecture, observability, debugging, and cost modeling, all from real H100 production deployments. Lifetime updates included.

Text-based, no videos
5 modules, 18 lessons
Lifetime access

What you'll learn

The gateway vs engine two-layer architecture that every production inference stack needs
vLLM configuration: KV cache math, prefill vs decode, quantization trade-offs
Gateway architecture on Kubernetes: streaming, routing, multi-tenancy, failover
Metrics that actually matter, separating gateway and engine signals
Tail latency debugging, where the 5% of requests that are 10x slow actually come from
Throughput degradation, why your p99 drifts up over days and how to fix it
The production debugging playbook for LLM inference incidents
True cost per token, the math that turns rough guesses into real budgets
Concurrent traffic economics, why doubling QPS doesn't double cost
A cost optimization playbook that actually ships savings, not slideware

Curriculum

5 modules · 18 lessons
01

The Production Inference Stack

The two-layer architecture that every production LLM stack converges on, the lifecycle of a single request, and the metrics that actually predict production health.

3 lessons
02

Single-GPU Optimization with vLLM

vLLM configuration, prefill vs decode dynamics, KV cache management, and quantization, everything you need to get maximum throughput out of one GPU.

4 lessons
03

Gateway Architecture on Kubernetes

Why you need a gateway, streaming APIs done right, routing and failover patterns, and multi-tenant gateway design on Kubernetes.

4 lessons
04

Observability & Debugging

How to actually see what your inference stack is doing: separating gateway and engine signals, debugging tail latency, chasing throughput regressions, and a playbook for the 3am pages.

4 lessons
05

Cost Modeling at Scale

The math that turns 'feels expensive' into a real budget. Cost per token, concurrency economics, and a playbook of optimizations that actually cut the bill.

3 lessons

About the Author

Sharon Sahadevan

Sharon Sahadevan

AI Infrastructure Engineer

Building production GPU clusters on Kubernetes. H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.

10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.

Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.

Ready to master this topic?

Start with the free preview lesson and see for yourself.