Kubernetes Debugging for SREs

The Layered Debugging Approach

The pager fires at 3 AM. The dashboard says "checkout API error rate at 8 percent." The on-call engineer has thirty seconds before the adrenaline takes over. What they do next determines whether the incident takes 15 minutes or four hours.

The teams who recover fast have one thing in common: they do not jump straight to a hypothesis. They use a layered approach to find the failure layer first, then debug within that layer. The teams that take four hours are usually the ones that started with "must be the database" and spent ninety minutes confirming the database was fine.

This lesson is the framework: five layers, the order to check them, and the cheap diagnostics at each level that orient you fast.

KEY CONCEPT

The fastest debug is the one that locates the failure layer before investigating any specific cause. Most production incidents in Kubernetes are not where the symptom appears: error rates spike on the application, but the cause is in the data plane (kubelet, CNI, kube-proxy) or below. The layered approach finds the layer in 5 minutes; jumping to the application layer first wastes the first 30 minutes.

The five layers

Every Kubernetes incident sits in one of these five layers:

The layers from smallest to largest blast radius:

Application: a bug, a misconfigured dependency, a corrupted state inside one pod's process. Affects one workload.
Pod: the pod itself is failing: Kubernetes-level lifecycle failure (OOMKilled, CrashLoopBackOff, Evicted, ImagePullBackOff). Affects pods of one workload.
Node: the node is unhealthy: kubelet down, disk full, kernel issue, network problem. Affects everything on that node.
Cluster: a control-plane component is failing: apiserver overloaded, CoreDNS lag, etcd disk pressure. Affects everything in the cluster.
Cloud: the cloud's underlying services are degraded: an AZ outage, IAM throttling, quota exhaustion. Affects everything below this cluster.

A symptom that looks application-level often has a cause one or more layers down. Error rate spikes on the checkout API; the cause is CoreDNS slow because the apiserver is throttled because etcd disk is slow because the cloud has degraded EBS performance in this AZ.

The layered approach catches this. Without it, you start at the application and spend an hour proving it is fine before looking elsewhere.

The triage in two minutes

When the alert fires, the first two minutes orient you. Three checks:

Check 1: scope of impact

Look at the SLO dashboard. Is the impact on every service or just one?

Every service: cluster or cloud layer. CoreDNS, apiserver, etcd, or cloud-side. Skip the application-level investigation.
One service: app, pod, or node layer. Drill into that service.

This single observation rules out 60% of the search space in 30 seconds.

Check 2: cluster events

kubectl get events -A --sort-by=.lastTimestamp | tail -30

Recent events tell you what changed. Common findings:

Node NotReady transitions: drop straight to node-level debugging.
Failed pods, eviction events: pod or node layer.
Failed scheduling: scheduler or capacity issue.
Backoff loops: pod or app layer.
Nothing unusual: app layer or sub-cluster issue not generating events.

The events feed is one of the most under-used diagnostic surfaces. Use it.

Check 3: what changed

The most predictive question of any incident: what changed.

Recent deploys (last 1-4 hours)?
Cluster upgrades, control-plane changes?
Configuration pushes, feature flag flips?
Cloud-side maintenance windows?
Traffic patterns out of normal range?

If the symptom started right after a deploy, the deploy is the prime suspect. If a kubelet upgrade went out yesterday and nodes are flapping NotReady today, that is your lead. Recent change is the highest-prior hypothesis and rules out most other causes if it explains the symptom.

These three checks take 90-120 seconds total. By the time you are two minutes in, you have a layer hypothesis. That hypothesis tells you what to investigate next.

Walking the layers

Once you have a layer hypothesis, you investigate within it. Each layer has its own quick diagnostic.

Application layer

Symptoms: errors come from the application's code or its calls to specific dependencies. The pod is healthy at the Kubernetes level, Running, Ready, no recent restarts. The service has been deployed recently or its dependencies have changed.

Diagnostic flow:

# Tail logs across replicas
kubectl logs -n prod-checkout -l app=checkout-api --tail=200 --since=10m

# Look at recent commits / deploys
kubectl rollout history deployment/checkout-api -n prod-checkout

# If the issue started with a recent deploy, mitigate first
kubectl rollout undo deployment/checkout-api -n prod-checkout

The application layer is where the application's own logs matter most. If the deploy is the suspect, mitigate first (revert) and root-cause after.

Pod layer

Symptoms: pods are failing at the Kubernetes lifecycle: Pending forever, CrashLoopBackOff, OOMKilled, ImagePullBackOff, Evicted.

Diagnostic flow:

# What state are pods in?
kubectl get pods -n prod-checkout -o wide

# Why is a specific pod failing?
kubectl describe pod -n prod-checkout checkout-api-abc123

# What did the previous container exit with?
kubectl logs -n prod-checkout checkout-api-abc123 --previous

# Recent events for this pod
kubectl get events -n prod-checkout \
  --field-selector involvedObject.name=checkout-api-abc123

kubectl describe pod is where most pod-level investigation starts. The Events section at the bottom narrates the pod's lifecycle. If the pod was OOMKilled, the previous container's logs often show the issue.

Module 2 covers each pod-level failure mode in depth.

Node layer

Symptoms: pods on a specific node are unhealthy while pods elsewhere are fine. Or the Node itself is NotReady. Or DaemonSets on that node are reporting issues.

Diagnostic flow:

# Cluster-wide node status
kubectl get nodes -o wide

# Why is a node NotReady?
kubectl describe node ip-10-0-3-21

# Pod distribution on the suspect node
kubectl get pods -A --field-selector spec.nodeName=ip-10-0-3-21

# Node-level diagnostics (if you have access)
kubectl debug node/ip-10-0-3-21 -it --image=nicolaka/netshoot

kubectl describe node shows the Conditions (Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable). The Events section shows recent state changes.

Module 3 covers node-level debugging in depth.

Cluster layer

Symptoms: every service is slow or returning errors. kubectl is sluggish. New pods are slow to schedule. Multiple Deployments stalled simultaneously.

Diagnostic flow:

# Apiserver health from outside (use kubectl from another machine if possible)
kubectl get --raw /healthz
kubectl get --raw /metrics | grep apiserver_request_duration

# CoreDNS health
kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system top pods -l k8s-app=kube-dns

# Etcd health (if accessible)
kubectl -n kube-system get pods -l component=etcd

# Pending pods (scheduler health)
kubectl get pods -A --field-selector status.phase=Pending

The cluster layer is the one most teams under-investigate. When every service is slow, suspect CoreDNS first (Module 4.2 covers this in depth), apiserver second, etcd third. Application-level latency due to slow DNS is one of the most-mistaken cluster-layer issues.

Module 6 covers control plane debugging in depth.

Cloud layer

Symptoms: cluster-wide impact AND something cloud-specific is wrong. AZ unreachable. IAM-related errors. EBS attaching slowly. Quota exhausted. The cloud's status page shows degraded service.

Diagnostic flow:

Cloud's status page (status.aws.amazon.com, status.cloud.google.com, status.azure.com).
Region/AZ checks: are nodes in one AZ NotReady?
IAM / cred checks: are pods getting AccessDenied errors that they did not yesterday?
Quota checks: did a scale-up fail because of a quota limit?

If the cloud's status page says degraded service in your region, your incident is likely downstream of that. Mitigation: fail over to another region if you have multi-region; otherwise wait, communicate, and watch for the cloud to recover.

When to escalate vs investigate

A specific judgment call: at what point do you escalate (page another engineer, alert leadership, declare incident) instead of continuing to investigate?

The thresholds:

5 minutes in, no progress on layer identification: escalate or grab a second pair of eyes. Working alone with no signal eats time.
15 minutes in, found the layer but cannot fix: escalate to whoever owns that layer (platform team for cluster issues, app team for app issues).
Customer-impacting, not yet mitigated: declare incident, communicate externally (status page).
Cloud-layer issue: status-page communication; coordinate with the cloud's support if you have premium support.

Escalation is not failure. It is the right call when your time-per-decision is too long for the impact level.

A worked example

A representative incident. The pager: "checkout API error rate at 8 percent for 3 minutes."

Minute 0-1: orient

SLO dashboard: only checkout-api is impacted; other services are fine.
That means: app, pod, or node layer (not cluster-wide).

Minute 1-2: cluster events

kubectl get events -A --sort-by=.lastTimestamp | tail -30

Output shows:

1m    Warning   FailedScheduling    pod/checkout-api-xyz   ...
1m    Warning   FailedScheduling    pod/checkout-api-yyy   ...
3m    Normal    NodeNotReady        node/ip-10-0-3-21      ...

Two pending pods + a NodeNotReady event 3 minutes ago. The layer hypothesis: node-level. Specifically, node ip-10-0-3-21 went down; pods on it cannot reschedule because of capacity or constraints.

Minute 2-3: confirm and mitigate

# Confirm the node is NotReady
kubectl get nodes ip-10-0-3-21

# What pods were running on that node?
kubectl get pods -A --field-selector spec.nodeName=ip-10-0-3-21

# Why pending pods can't schedule
kubectl describe pod -n prod-checkout checkout-api-xyz | grep -A 10 Events

Pending pods show: 0/12 nodes are available: 11 Insufficient cpu, 1 node(s) had untolerated taint. The remaining 11 nodes are at capacity; the bad node accounts for the 12th.

Minute 3-5: mitigate

The mitigation choices:

Scale up the cluster (if cluster autoscaler is on, this should be in flight).
Cordon and drain the bad node (force its pods to reschedule with priority).
Reduce other workload pressure (terminate non-critical pods to free capacity).

Most likely the cluster autoscaler is provisioning a new node; you wait 60-180 seconds for it. Pods reschedule; error rate drops.

Minute 5-30: root-cause

After mitigation, why did the node go NotReady? Module 3 walks through this. Could be:

Spot interruption.
Hardware failure.
Kubelet OOM.
Network partition.

Each has a specific signature; check kubelet logs, instance metadata, Kubernetes events around the transition time.

What this would have looked like without the layered approach

Without the framework, the on-call's first reaction might be: "checkout API is failing, must be a checkout API bug, let me look at the code." Spent 20 minutes reading recent commits. No bug found. Then "must be the database, let me check." 20 more minutes. Still no progress.

The layered approach finds the right layer in 90 seconds. The rest is normal investigation.

Anti-patterns

The patterns that waste time:

Hypothesis tunnel vision

"It must be the cache." Spent 30 minutes confirming the cache was fine. Was actually a NetworkPolicy that someone deployed 4 hours ago. The first impression is rarely the right one; verify against multiple signals.

Going straight to logs

kubectl logs -f on the application's pod, scrolling, looking for clues. Hard to spot the pattern. Without context (what changed? what events?), logs are noise.

Changing things while investigating

"Let me restart the deployment." Changes during investigation introduce new variables; you no longer know whether the symptom is the original incident or a side-effect of the change.

The exception: clear mitigation (revert a recent deploy). That is intentional and reduces user impact, not exploratory.

Skipping the cluster events feed

kubectl get events --sort-by=.lastTimestamp is the single most useful diagnostic command in Kubernetes. Skipping it means missing the obvious causes. Do this every time.

Working alone too long

The on-call who solo-debugs for 45 minutes before escalating costs everyone. Pair-debugging is faster. Two engineers find issues twice as fast even when only one is running commands; the second one notices what the first misses.

WAR STORY

A team's pager fires for "checkout API errors." On-call engineer assumes app bug; spends 40 minutes looking at recent code changes. Eventually a senior engineer joins and runs kubectl get events --sort-by=.lastTimestamp. Top result: a CoreDNS pod was OOMKilled 38 minutes ago and never came back; the remaining replica was overloaded. Every checkout API request was waiting on slow DNS. Fix: scale up CoreDNS, kill OOMKilled pod's PVC if relevant. Total fix time after running the events command: 3 minutes. The 40-minute prior investigation was on the wrong layer entirely. Lesson: events first, hypothesis second.

Building the muscle

The layered approach gets faster with practice. Two ways to build the muscle:

Run the framework on every incident

Every alert: orient (scope), check events, identify recent change. Even when the symptom seems obvious, force the framework. The reps build the habit.

Read post-mortems

Your own and others'. Public ones from companies like Cloudflare, AWS, Stripe. Each one has a layer the cause sat in; trace the symptom-to-cause path. Build a mental library of "looked-like-X-was-actually-Y" patterns.

After a year of consistent practice, the framework is automatic. The first 90 seconds of every incident produces a layer hypothesis without conscious effort. From there the specific debugging within the layer takes over, covered in the next modules.

Summary

The layered debugging framework: five layers (App, Pod, Node, Cluster, Cloud), checked in priority order, with cheap diagnostics at each.

The two-minute triage:

Scope of impact: every service or one?
Cluster events feed: what changed in the last few minutes?
What changed: deploy, config, upgrade, infra event?

The principles:

Locate the layer first; investigate within the layer second.
Recent change is the highest-prior hypothesis.
Cluster events feed is the single most useful diagnostic command.
Mitigate user impact before root-causing.
Escalate when time-per-decision is too long for the impact level.

The next lesson goes one level deeper: the discipline of finding root causes vs. fixing symptoms, and why the difference matters for whether you debug the same incident twice.

Continue

Symptoms vs Root Causes

←→ navigateM toggle sidebar