Advanced|18 hours|36 lessons

Kubernetes Architecture & Chaos

How Kubernetes actually works under the hood, from API server request lifecycle to etcd Raft to the scheduler framework, paired with chaos engineering reasoning that turns architectural knowledge into operational confidence. Built for the interview question "walk me through what happens when you create a pod" and the production question "how do we test resilience without breaking customers?"

Text-based, no videos

12 modules, 36 lessons

Lifetime access

What you'll learn

The control-plane / data-plane split and why every Kubernetes component fits into one or the other

API server internals: request lifecycle, admission controllers, performance under load

etcd architecture: Raft consensus, watch streams, the failure modes that take clusters down

The scheduler framework: predicates, priorities, profiles, and scaling past 10,000 pods

How controllers and operators reconcile state, and what the controller-runtime library does for you

The kubelet's pod lifecycle, from API watch to running container via CRI

Networking internals: kube-proxy, CNI dataplane (Calico, Cilium), DNS and service discovery at scale

Mapping your cluster's failure domains: what survives an apiserver outage, an etcd loss, a control plane gone

Chaos engineering fundamentals: hypothesis-driven experiments, blast-radius scoping, game days

Pod, node, and cluster-level chaos for Kubernetes: kill, network delay, partition, control plane failure

Tools in practice: Chaos Mesh, Litmus, Chaos Monkey for K8s, and CI/CD-integrated chaos

Interview-ready architecture reasoning: "walk me through pod creation", "design a self-healing cluster", "test resilience without breaking production"

Curriculum

12 modules · 36 lessons

The Kubernetes Architecture Mental Model

The clean separation of control plane and data plane, the API server as universal bus, and reconciliation loops as the universal pattern.

3 lessons

The Control Plane and Data Plane Split30 minFREE The API Server as the Universal Bus30 minFREE Reconciliation Loops Everywhere30 minFREE

API Server Internals

Request lifecycle, admission controller pipeline, and the performance characteristics that determine how the apiserver scales.

3 lessons

Request Lifecycle Through the API Server30 min Admission Controllers (Validating, Mutating)30 min API Server Performance and Scaling30 min

etcd Architecture

Raft consensus mechanics, the watch stream model, and the failure modes that take clusters offline.

3 lessons

Raft Consensus in Practice30 min Watch Streams and How They Scale30 min etcd Failure Modes30 min

The Scheduler

The scheduling framework, the predicate/priority pipeline, and the design choices that make the scheduler work past 10,000 pods.

3 lessons

Scheduling Framework30 min Predicates, Priorities, Profiles30 min Scheduler at Scale (10,000+ pods)30 min

Controllers and Operators

The controller pattern that makes Kubernetes a platform, the built-in controllers that run your workloads, and the operator pattern for everything else.

3 lessons

The Controller Pattern30 min Built-in Controllers Tour30 min Operators and CRDs30 min

kubelet Deep Dive

The pod lifecycle from kubelet's perspective, the CRI shim model, and the kubelet failure modes that take nodes offline.

3 lessons

The Pod Lifecycle on a Node30 min Container Runtime Interface (CRI)30 min kubelet Failure Modes30 min

Networking Internals

How Services, CNI, and DNS actually work at the iptables/eBPF layer, with the scaling characteristics that matter at thousands of services.

3 lessons

kube-proxy and Service Implementation30 min CNI Deep Dive (Calico, Cilium internals)30 min DNS and Service Discovery at Scale30 min

Failure Domains

What survives which kind of failure. The bounded-impact reasoning that turns architecture knowledge into resilience design.

3 lessons

Mapping Your Cluster's Failure Domains30 min What Survives a Control Plane Failure30 min What Survives an etcd Loss30 min

Chaos Engineering Fundamentals

The discipline of breaking things on purpose, why it works, and the hypothesis-driven approach that separates chaos engineering from random pod-killing.

3 lessons

Why You Need to Break Things on Purpose30 min Hypothesis-Driven Chaos30 min Blast Radius and Game Days30 min

Chaos Engineering for Kubernetes

The concrete experiments at the pod, node, and cluster level: what each one tests, what bugs it surfaces, and how to scope it.

3 lessons

Pod Chaos (kill, network delay, CPU pressure)30 min Node Chaos (drain, terminate, network partition)30 min Cluster Chaos (control plane failure, etcd partition)30 min

Tools in Practice

The tooling landscape, the game day playbook, and the CI/CD integration that turns chaos from quarterly drill into continuous practice.

3 lessons

Litmus, Chaos Mesh, Chaos Monkey for K8s30 min Designing Game Days30 min Chaos in CI/CD30 min

Interview-Ready Architecture Reasoning

The big interview questions answered with the reasoning frameworks the rest of the course built up.

3 lessons

Walk Me Through What Happens When You Create a Pod30 min Design a Self-Healing K8s Cluster30 min How Would You Test Resilience Without Breaking Production?30 min

About the Author

Sharon Sahadevan

AI Infrastructure Engineer

Building production GPU clusters on Kubernetes. H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.

10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.

Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.

Ready to master this topic?

Start with the free preview lesson and see for yourself.