Intermediate-Advanced|12 hours|24 lessons

Kubernetes Debugging for SREs

The systematic debugging playbook for Kubernetes in production. From the layered debugging mental model (App → Pod → Node → Cluster → Cloud) to the 3 AM incident playbook. Pod failures, node issues, networking problems, storage debugging, control plane diagnostics, and incident response, built from real production incident experience.

Text-based, no videos

8 modules, 24 lessons

Lifetime access

What you'll learn

The layered debugging mental model: App, Pod, Node, Cluster, Cloud, and how to locate the failure layer in 5 minutes

Symptoms vs root causes: why most teams fix symptoms repeatedly instead of finding causes

The investigation toolkit every SRE needs: kubectl, crictl, journalctl, tcpdump, ephemeral containers, kubectl debug

Pod-level debugging: pod won't start, started-but-crashed, running-but-wrong, concrete diagnostic flows for each

Node-level debugging: NotReady transitions, slow nodes, node-specific pod issues

Networking debugging: pod-to-pod connectivity, DNS, ingress and load balancers, with real production scenarios

Storage debugging: PVC stuck Pending, mount failures, the data loss scenarios that require recovery

Control plane debugging: apiserver slowness, scheduler stuck, controller-manager issues

The 3 AM incident playbook: first 5 minutes, communication, mitigation patterns

Communicating during incidents: what to say, when to escalate, status page discipline

The post-incident review: blameless format, action items that get done, learning from incidents

Building debug-friendly systems: observability, runbooks, chaos engineering as prevention

Real production incidents walked through end-to-end with the diagnostic flow

Curriculum

8 modules · 24 lessons

The Debugging Mental Model

The layered approach (App, Pod, Node, Cluster, Cloud), the symptom-vs-root-cause discipline, and the investigation toolkit every SRE needs.

3 lessons

The Layered Debugging Approach30 minFREE Symptoms vs Root Causes30 minFREE The Investigation Toolkit30 minFREE

Pod-Level Debugging

The three flavors of pod failure (won't start, started but crashed, running but wrong) and the diagnostic flow for each.

3 lessons

The Pod Won't Start30 min The Pod Started But Crashed30 min The Pod Is Running But Wrong30 min

Node-Level Debugging

When the node is the layer at fault. NotReady transitions, slow nodes, and the per-node performance issues that look like application bugs.

3 lessons

The Node Is NotReady30 min The Node Is Slow30 min The Pod Is Slow on This Specific Node30 min

Networking Debugging

The most-feared debugging surface. Pod-to-pod connectivity, DNS, ingress and load balancers, the systematic walk through each layer.

3 lessons

Pods Can't Reach Each Other30 min DNS Issues30 min Ingress and Load Balancer Issues30 min

Storage Debugging

The PVC lifecycle and where it gets stuck. Pending PVCs, mount failures, and the data loss scenarios that require careful recovery.

3 lessons

PVC Stuck in Pending30 min Pod Stuck Mounting Volume30 min Data Loss Scenarios30 min

Control Plane Debugging

When the cluster's brain is the layer at fault. Slow apiserver, stuck scheduler, broken controllers, the highest-blast-radius debugging surface.

3 lessons

The API Server Is Slow30 min The Scheduler Won't Place Pods30 min Controllers Stopped Working30 min

The 3 AM Incident Playbook

The structured response when production is on fire. First five minutes, communication during the incident, and the post-mortem that produces real learning.

3 lessons

The First 5 Minutes30 min Communicating During Incidents30 min The Post-Incident Review30 min

Building Debug-Friendly Systems

Designing systems that are easier to debug before the next incident. Observability that helps under pressure, runbooks that work, and chaos engineering as preventive maintenance.

3 lessons

Observability That Helps During Incidents30 min Runbooks That Actually Work30 min Chaos Engineering as Prevention30 min

About the Author

Sharon Sahadevan

AI Infrastructure Engineer

Building production GPU clusters on Kubernetes. H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.

10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.

Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.

Ready to master this topic?

Start with the free preview lesson and see for yourself.