Kubernetes Debugging for SREs
The systematic debugging playbook for Kubernetes in production. From the layered debugging mental model (App → Pod → Node → Cluster → Cloud) to the 3 AM incident playbook. Pod failures, node issues, networking problems, storage debugging, control plane diagnostics, and incident response — built from real production incident experience.
One-time payment. Lifetime access.
What you'll learn
Curriculum
8 modules · 24 lessonsThe Debugging Mental Model
The layered approach (App, Pod, Node, Cluster, Cloud), the symptom-vs-root-cause discipline, and the investigation toolkit every SRE needs.
Pod-Level Debugging
The three flavors of pod failure (won't start, started but crashed, running but wrong) and the diagnostic flow for each.
Node-Level Debugging
When the node is the layer at fault. NotReady transitions, slow nodes, and the per-node performance issues that look like application bugs.
Networking Debugging
The most-feared debugging surface. Pod-to-pod connectivity, DNS, ingress and load balancers — the systematic walk through each layer.
Storage Debugging
The PVC lifecycle and where it gets stuck. Pending PVCs, mount failures, and the data loss scenarios that require careful recovery.
Control Plane Debugging
When the cluster's brain is the layer at fault. Slow apiserver, stuck scheduler, broken controllers — the highest-blast-radius debugging surface.
The 3 AM Incident Playbook
The structured response when production is on fire. First five minutes, communication during the incident, and the post-mortem that produces real learning.
Building Debug-Friendly Systems
Designing systems that are easier to debug before the next incident. Observability that helps under pressure, runbooks that work, and chaos engineering as preventive maintenance.
About the Author

Sharon Sahadevan
AI Infrastructure Engineer
Building production GPU clusters on Kubernetes — H100s, large-scale model serving, and end-to-end ML infrastructure across Azure and AWS.
10+ years designing cloud-native platforms with deep expertise in Kubernetes orchestration, GitOps (Argo CD), Terraform, and MLOps pipelines for LLM deployment.
Author of KubeNatives, a weekly newsletter read by 3,000+ DevOps and ML engineers for production insights on K8s internals, GPU scheduling, and model-serving patterns.