Advanced|9 hours|18 lessons

Production LLM Inference on Kubernetes

Deep production knowledge for engineers running LLM inference on self-managed Kubernetes. vLLM optimization, gateway architecture, observability, debugging, and cost modeling — all from real H100 production deployments. Phase 1 launches with 5 modules. Phase 2 (3 additional modules covering engine comparisons, multi-GPU parallelism, and multi-node scaling) ships within 90 days. Lifetime updates included.

Early Bird Pricing
$79$59Save $20

One-time payment. Lifetime access.

Text-based, no videos
5 modules, 18 lessons
Lifetime access

What you'll learn

The gateway vs engine two-layer architecture that every production inference stack needs
vLLM configuration — KV cache math, prefill vs decode, quantization trade-offs
Gateway architecture on Kubernetes — streaming, routing, multi-tenancy, failover
Metrics that actually matter — separating gateway and engine signals
Tail latency debugging — where the 5% of requests that are 10x slow actually come from
Throughput degradation — why your p99 drifts up over days and how to fix it
The production debugging playbook for LLM inference incidents
True cost per token — the math that turns rough guesses into real budgets
Concurrent traffic economics — why doubling QPS doesn't double cost
A cost optimization playbook that actually ships savings, not slideware

Curriculum

5 modules · 18 lessons
01

The Production Inference Stack

The two-layer architecture that every production LLM stack converges on, the lifecycle of a single request, and the metrics that actually predict production health.

3 lessons
02

Single-GPU Optimization with vLLM

vLLM configuration, prefill vs decode dynamics, KV cache management, and quantization — everything you need to get maximum throughput out of one GPU.

4 lessons
03

Gateway Architecture on Kubernetes

Why you need a gateway, streaming APIs done right, routing and failover patterns, and multi-tenant gateway design on Kubernetes.

4 lessons
04

Observability & Debugging

How to actually see what your inference stack is doing — separating gateway and engine signals, debugging tail latency, chasing throughput regressions, and a playbook for the 3am pages.

4 lessons
05

Cost Modeling at Scale

The math that turns 'feels expensive' into a real budget. Cost per token, concurrency economics, and a playbook of optimizations that actually cut the bill.

3 lessons

Coming in Phase 2

Phase 2 ships within 90 days — lifetime updates included. Buy once, get every module as it ships.

06

Inference Engine Comparison

vLLM vs SGLang vs TensorRT-LLM — benchmarks, when to pick each, and the migration story when you outgrow one.

PHASE 2
07

Multi-GPU Parallelism

Tensor parallelism, pipeline parallelism, expert parallelism — what each is for, how they interact, and how to pick a strategy for your model size and traffic shape.

PHASE 2
08

Multi-Node Scaling

Ray clusters, interconnect choices (NVLink vs InfiniBand vs Ethernet RDMA), and the orchestrators that make multi-node inference actually work in production.

PHASE 2

About the Author

Sharon Sahadevan

Sharon Sahadevan

AI Infrastructure Engineer

Building production GPU clusters on Kubernetes — H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.

10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.

Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.

Ready to master this topic?

Start with the free preview lesson and see for yourself.