Advanced|9 hours|18 lessons

Production LLM Inference on Kubernetes

Deep production knowledge for engineers running LLM inference on self-managed Kubernetes. vLLM optimization, gateway architecture, observability, debugging, and cost modeling — all from real H100 production deployments. Phase 1 launches with 5 modules. Phase 2 (3 additional modules covering engine comparisons, multi-GPU parallelism, and multi-node scaling) ships within 90 days. Lifetime updates included.

Early Bird Pricing

$79$59Save $20

One-time payment. Lifetime access.

Text-based, no videos

5 modules, 18 lessons

Lifetime access

What you'll learn

The gateway vs engine two-layer architecture that every production inference stack needs

vLLM configuration — KV cache math, prefill vs decode, quantization trade-offs

Gateway architecture on Kubernetes — streaming, routing, multi-tenancy, failover

Metrics that actually matter — separating gateway and engine signals

Tail latency debugging — where the 5% of requests that are 10x slow actually come from

Throughput degradation — why your p99 drifts up over days and how to fix it

The production debugging playbook for LLM inference incidents

True cost per token — the math that turns rough guesses into real budgets

Concurrent traffic economics — why doubling QPS doesn't double cost

A cost optimization playbook that actually ships savings, not slideware

Curriculum

5 modules · 18 lessons

The Production Inference Stack

The two-layer architecture that every production LLM stack converges on, the lifecycle of a single request, and the metrics that actually predict production health.

3 lessons

Gateway vs Engine: The Two-Layer Architecture30 minFREE The Inference Request Lifecycle30 minFREE Metrics That Actually Matter30 minFREE

Single-GPU Optimization with vLLM

vLLM configuration, prefill vs decode dynamics, KV cache management, and quantization — everything you need to get maximum throughput out of one GPU.

4 lessons

vLLM Configuration Deep Dive30 min Prefill vs Decode Stages30 min KV Cache Management30 min Quantization in Production30 min

Gateway Architecture on Kubernetes

Why you need a gateway, streaming APIs done right, routing and failover patterns, and multi-tenant gateway design on Kubernetes.

4 lessons

Why You Need a Gateway30 min Streaming APIs Done Right30 min Routing and Failover30 min Multi-Tenant Gateway Design30 min

Observability & Debugging

How to actually see what your inference stack is doing — separating gateway and engine signals, debugging tail latency, chasing throughput regressions, and a playbook for the 3am pages.

4 lessons

Separating Gateway vs Engine Metrics30 min Tail Latency Debugging30 min Throughput Degradation Over Time30 min The Production Debugging Playbook30 min

Cost Modeling at Scale

The math that turns 'feels expensive' into a real budget. Cost per token, concurrency economics, and a playbook of optimizations that actually cut the bill.

3 lessons

The True Cost Per Token30 min Concurrent Traffic Economics30 min Cost Optimization Playbook30 min

Coming in Phase 2

Phase 2 ships within 90 days — lifetime updates included. Buy once, get every module as it ships.

Inference Engine Comparison

vLLM vs SGLang vs TensorRT-LLM — benchmarks, when to pick each, and the migration story when you outgrow one.

PHASE 2

Multi-GPU Parallelism

Tensor parallelism, pipeline parallelism, expert parallelism — what each is for, how they interact, and how to pick a strategy for your model size and traffic shape.

PHASE 2

Multi-Node Scaling

Ray clusters, interconnect choices (NVLink vs InfiniBand vs Ethernet RDMA), and the orchestrators that make multi-node inference actually work in production.

PHASE 2

About the Author

Sharon Sahadevan

AI Infrastructure Engineer

Building production GPU clusters on Kubernetes — H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.

10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.

Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.

Ready to master this topic?

Start with the free preview lesson and see for yourself.