Intermediate-Advanced|12 hours|24 lessons

Kubernetes Debugging for SREs

The systematic debugging playbook for Kubernetes in production. From the layered debugging mental model (App → Pod → Node → Cluster → Cloud) to the 3 AM incident playbook. Pod failures, node issues, networking problems, storage debugging, control plane diagnostics, and incident response — built from real production incident experience.

Early Bird Pricing
$79$59Save $20

One-time payment. Lifetime access.

Text-based, no videos
8 modules, 24 lessons
Lifetime access

What you'll learn

The layered debugging mental model — App, Pod, Node, Cluster, Cloud — and how to locate the failure layer in 5 minutes
Symptoms vs root causes: why most teams fix symptoms repeatedly instead of finding causes
The investigation toolkit every SRE needs: kubectl, crictl, journalctl, tcpdump, ephemeral containers, kubectl debug
Pod-level debugging: pod won't start, started-but-crashed, running-but-wrong — concrete diagnostic flows for each
Node-level debugging: NotReady transitions, slow nodes, node-specific pod issues
Networking debugging: pod-to-pod connectivity, DNS, ingress and load balancers — with real production scenarios
Storage debugging: PVC stuck Pending, mount failures, the data loss scenarios that require recovery
Control plane debugging: apiserver slowness, scheduler stuck, controller-manager issues
The 3 AM incident playbook: first 5 minutes, communication, mitigation patterns
Communicating during incidents — what to say, when to escalate, status page discipline
The post-incident review: blameless format, action items that get done, learning from incidents
Building debug-friendly systems: observability, runbooks, chaos engineering as prevention
Real production incidents walked through end-to-end with the diagnostic flow

Curriculum

8 modules · 24 lessons
01

The Debugging Mental Model

The layered approach (App, Pod, Node, Cluster, Cloud), the symptom-vs-root-cause discipline, and the investigation toolkit every SRE needs.

3 lessons
02

Pod-Level Debugging

The three flavors of pod failure (won't start, started but crashed, running but wrong) and the diagnostic flow for each.

3 lessons
03

Node-Level Debugging

When the node is the layer at fault. NotReady transitions, slow nodes, and the per-node performance issues that look like application bugs.

3 lessons
04

Networking Debugging

The most-feared debugging surface. Pod-to-pod connectivity, DNS, ingress and load balancers — the systematic walk through each layer.

3 lessons
05

Storage Debugging

The PVC lifecycle and where it gets stuck. Pending PVCs, mount failures, and the data loss scenarios that require careful recovery.

3 lessons
06

Control Plane Debugging

When the cluster's brain is the layer at fault. Slow apiserver, stuck scheduler, broken controllers — the highest-blast-radius debugging surface.

3 lessons
07

The 3 AM Incident Playbook

The structured response when production is on fire. First five minutes, communication during the incident, and the post-mortem that produces real learning.

3 lessons
08

Building Debug-Friendly Systems

Designing systems that are easier to debug before the next incident. Observability that helps under pressure, runbooks that work, and chaos engineering as preventive maintenance.

3 lessons

About the Author

Sharon Sahadevan

Sharon Sahadevan

AI Infrastructure Engineer

Building production GPU clusters on Kubernetes — H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.

10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.

Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.

Ready to master this topic?

Start with the free preview lesson and see for yourself.