Kubernetes Debugging for SREs

Symptoms vs Root Causes

The same incident, three months apart. A team gets paged because checkout API is returning 5xx errors. They restart the deployment; errors stop. Post-incident: "marked resolved." Three months later, the same alert fires; same fix; same resolution. The team is debugging the same incident repeatedly because they fixed the symptom (5xx errors) without finding the root cause (a memory leak that triggers OOMKill every few months).

This lesson is about that discipline. Why teams fix symptoms instead of root causes, what root cause analysis actually looks like in Kubernetes, and the patterns that prevent the same incident from happening again.

KEY CONCEPT

Most production teams I have audited can name three to five recurring incidents that "just keep happening." Every one is a symptom-fix masquerading as a resolution. The root cause has not been found because the team mitigated the user impact (correctly) and then stopped (incorrectly). Mitigation is the first move; root cause analysis is what prevents the next page.

What "symptom" and "root cause" mean

A definition that helps:

Symptom: the user-visible or alert-visible effect of the problem. "Error rate is 5 percent." "Pod is in CrashLoopBackOff." "p99 latency spiked."
Mitigation: an action that makes the symptom stop. Restart the deployment, scale up replicas, fail over to another region.
Root cause: the underlying condition that caused the symptom. Memory leak in the checkout service. Conntrack table full. A cron job creating a noisy resource at midnight.

A complete incident response has all three: identify the symptom, mitigate the user impact, find the root cause and fix it. Skipping the third step means the incident recurs.

Why teams skip root cause analysis

The honest reasons, none of them malicious:

Mitigation worked, so the team relaxes

The pressure was high while users were impacted. Once the mitigation works, the pressure drops. Engineers context-switch; the incident moves to "follow-up." The follow-up never happens because there is always something more urgent.

"We don't know exactly what happened, but it's fixed"

The mitigation worked but the team is not sure why. Restarting fixed it; what was wrong before the restart? Without follow-up investigation, the team learns nothing. The next time, same blind spot.

The fix is "we'll be more careful"

Post-mortem action: "engineers should review configuration more carefully before deploying." This is not a fix. Humans make the same mistakes systematically; "be more careful" is a wish, not an action.

The cost of investigating exceeds perceived value

Two-hour incident, mitigated with a restart. Investigating the root cause might take a day. The team rationalizes: "it was a one-off." It is not. It will recur, and the cumulative cost over a year is more than the day of investigation.

No one owns the post-mortem

The on-call engineer mitigated. The application team owns the code. The platform team owns the cluster. Nobody is the obvious owner of "find the root cause." It falls between teams; nobody picks it up.

The discipline of root cause analysis exists to fight all of these.

The 5 whys, in Kubernetes form

A useful technique borrowed from Toyota, adapted for software incidents. After mitigation, ask "why" five times.

A worked example:

Symptom: checkout API returned 5xx errors for 12 minutes.

Why? Pods were OOMKilled and crashlooping.
- Why? Memory usage spiked to 2 GB; limit was 1 GB.
  - Why? A specific request type was holding 500 MB in memory; concurrent requests stacked.
    - Why? The request-handler had no cache eviction; every cached object stayed forever.
      - Why? The cache was added in PR #1234 without eviction logic; nobody caught it in review.

The fix is not "increase the memory limit" (the first level — symptom-fix). The fix is "add eviction to the cache" (the fifth level — root cause). Plus a process-level fix: PR review checklist for cache changes.

A symptom-fix would be: bump the memory limit to 2 GB. Works; recurs in three months when usage grows further.

A root-cause fix is: bounded cache + process improvement. The same memory pattern cannot break the system again.

Common Kubernetes symptom-fixes

Patterns I see in production teams that look like fixes but are not:

"We restarted the pod"

If a restart fixed it, why was the pod broken in the first place? Memory leak? Connection pool exhaustion? File descriptor leak? Stuck thread?

Restart is mitigation. Root cause is the reason the restart was needed. Without finding it, the timer is just running until the next restart.

"We scaled up replicas"

If scaling up fixed the symptom, the workload was capacity-bound. Why did capacity become inadequate? Traffic growth? An inefficient code path? A memory leak that effectively halves capacity?

Scaling up is fine as a mitigation, but the question "why did we need to scale up?" has a real answer.

"We bumped the resource limits"

The pod was OOMKilled at 1 GB; we bumped to 2 GB. The new limit holds. Did we just kick the can down the road, or did we genuinely need 2 GB?

Often this is symptom-fix masquerading as right-sizing. The leak that pushed past 1 GB will eventually push past 2 GB. The fix is to find the leak.

"We rolled back the deploy"

Rolling back stopped the errors. What in the deploy caused the errors? If you do not find that, you cannot deploy the change again — and you cannot tell whether the next change has the same problem.

Rollback is correct as immediate mitigation. Root cause analysis is finding what specifically changed.

"It was a one-off"

Sometimes things really are one-offs. A specific bad packet, a brief external event, an action taken manually that won't be repeated. But "one-off" is a hypothesis, not a fact. Verify it against the data:

Is this metric pattern unique to this incident, or does it appear historically?
Could the conditions recur (cron schedule, traffic patterns, deploy windows)?
Is the proposed fix robust to recurrence?

If the answers cannot rule out recurrence, the incident is not a one-off; it is a known unknown waiting to happen again.

The diagnostic discipline

A practical structure for root cause analysis after mitigation:

1. Reconstruct the timeline

Before what was happening, what was the state? At what specific moment did the symptom start? What changed in the minutes before?

# Events leading up to the incident
kubectl get events -A --sort-by=.lastTimestamp | grep -B 50 "$(date -d '15 minutes ago' --iso-8601=seconds)"

# Pod state changes
kubectl get pods -n prod-checkout --watch

The timeline is the spine of the post-mortem.

2. Identify the trigger

What specific condition started the incident? It is almost always identifiable:

A traffic burst.
A specific deploy.
A scheduled job firing.
A cron-like external event.
A node going NotReady.
A configuration push.

The trigger is "what entered the system that broke it." Without identifying the trigger, you do not have a complete causal model.

3. Map cause to effect

From the trigger forward: how did the trigger turn into the symptom? Each step in the chain is an opportunity for prevention.

For the OOMKill example:

Trigger: traffic spike at 14:23
  -> 50 concurrent requests of "type X" (15 normally)
    -> each cached 500 MB
      -> total memory exceeded limit
        -> OOMKilled
          -> CrashLoopBackOff
            -> 5xx errors (symptom)

Each arrow is a "why." Breaking any link prevents the symptom. The cheapest break is at the cache layer (add eviction); the next-cheapest is at the limit layer (right-size); the most expensive is at the trigger layer (rate-limit traffic).

4. Categorize the cause

The five categories I use:

Bug: code is wrong. Fix the code.
Configuration: a Kubernetes setting, env var, flag is wrong. Fix the config.
Capacity: not enough resources. Scale up; better autoscaling.
Process: the change that caused this should not have been made; review process needs improvement.
Architecture: the system is structurally fragile; needs design change.

The category matters because the fix is different for each. A configuration fix is a small PR; an architectural fix is a quarter of work.

5. Plan and execute the fix

Specific, scoped, owned, time-bound:

Action: add cache eviction with a 100-item LRU.
Owner: payments-team (specific person, not group).
Deadline: end of next sprint.
Verification: load test that 100 concurrent requests stay under 1 GB.

Action items without owners or deadlines are wishes. Hold the discipline.

A common pattern: layer mismatch

A specific subset of symptom-fixes: fixing the wrong layer.

The symptom appears in the application layer (5xx errors). The team fixes the application (rolls back, adjusts a config). The actual cause is a node-level issue (PLEG slow, kubelet crashlooping). The application fix happens to coincide with the node recovering; the team thinks they fixed it.

The next time the node misbehaves, the symptom returns. The team is frustrated: "we fixed this last time."

The fix: trace causes back to the layer they originated. The layered framework from lesson 1.1 helps; the events feed shows whether the issue was actually in the application or somewhere lower.

When to stop digging

Root cause analysis can go too deep. At some point, the cause is so far removed from the immediate problem that fixing it is impractical.

A useful stopping criterion: the cause is something you can change, or that you can institutionalize against.

"The traffic spike was caused by a marketing campaign" — you cannot stop marketing campaigns. The fix is autoscaling that handles them.
"The CPU was throttled because the cgroup hierarchy is shared" — you cannot rewrite the kernel. The fix is dedicated cores or isolation.
"The bug is in the upstream open-source library" — you cannot rewrite the library. The fix is upgrade, workaround, or vendor support.

The "5 whys" is a guideline; not every symptom has a fixable level-five cause. Stop when the cause becomes a fact of life and the right fix is at a higher level.

When recurrence is okay

Some recurrence is actually fine:

Pod restarts due to expected lifecycle: Kubernetes is designed for pods to restart. A pod OOMKilled once a quarter under unusual load might be acceptable.
Spot interruptions: training pods getting interrupted by spot is by design; recovery should be smooth.
Brief blips during scale-up: an SLO that allows occasional latency spikes is fine; not every spike needs root-causing.

The discipline is not "never restart anything." It is "every restart should be expected, not surprising." Surprising restarts are incidents; expected ones are routine.

A real cycle of symptom vs root cause

A team I worked with had a recurring 5xx incident on their payments service:

Month 1

Pager fires. Mitigation: restart the deployment. Resolved in 5 minutes. Post-mortem: "transient issue, marked resolved."

Month 4

Pager fires again. Same restart fixes it. Post-mortem: "second occurrence, no clear pattern; marked transient."

Month 7

Third occurrence. The team starts to investigate. Discovery: the service had a connection pool that occasionally got into a bad state where every connection was poisoned. Restart cleared it. Cause: a specific kind of network blip the pool's reconnect logic could not recover from cleanly.

Month 8

Fix shipped: improved connection pool with explicit reset on detection of bad state. No recurrence.

The cumulative cost of three incidents (each ~30 minutes of customer impact, ~2 hours of engineering) was more than the one day of debugging that found the root cause. The decision to investigate after the second occurrence (instead of after the third or fourth) would have prevented the third.

Building the discipline

The patterns that work in teams:

Mandatory post-mortem for any user-impacting incident

Even small ones. The discipline of "what happened, what was the cause, what is the fix" forces engagement.

Action items have owners and deadlines

No "team will look into it." Specific person; specific deadline; tracked to completion.

Track recurrence

A simple list: "incidents we've seen more than once." Each recurrence triggers escalation: "we need to actually fix this."

Read your own post-mortems

Quarterly: a meeting where the team reads the last quarter's post-mortems and asks "what have we learned?" Without this, post-mortems become write-only documentation.

Reward root-cause work

Recognize engineers who fix root causes, not just engineers who put out fires. The first is more valuable; the culture should reflect that.

WAR STORY

A team I helped had a "weekly database alert" that they had been responding to for a year. Every Wednesday at 2 AM, the alert fired; the on-call engineer scaled up the database read replicas; the alert cleared. Year-over-year, the same drill. After joining, I asked "why does this happen at 2 AM Wednesdays?" Nobody had asked. The investigation took 30 minutes: a weekly batch ETL job ran every Wednesday at 2 AM, generating a brief 10x read load. The replicas were undersized for the burst. Fix: pre-scale replicas at 1:55 AM Wednesday via cron. The alert never fired again. The team had been treating the symptom every week for a year because nobody had asked "why is this Wednesday?" Lesson: pattern recognition (weekly, scheduled) is a strong signal of root cause; trace it.

What good post-mortems look like

The structure that produces useful post-mortems:

# Post-Mortem: [Incident Title]
**Date**: 2026-04-25
**Severity**: SEV-2
**Duration**: 18 minutes
**Author**: [name]

## Summary
[1-3 sentences: what happened, who was impacted, how long]

## Timeline
- 14:23 - First alert fired (SLO breach on payments)
- 14:24 - On-call acknowledged
- 14:27 - Hypothesis: app bug from recent deploy
- 14:32 - Ruled out app bug; switched to node investigation
- 14:35 - Found node-level cause (PLEG unhealthy)
- 14:38 - Mitigation: cordoned bad node; pods rescheduled
- 14:41 - Recovery confirmed

## Root Cause
The PLEG (Pod Lifecycle Event Generator) on node X became unhealthy due to a stuck container that the runtime could not unblock. Cause: a specific kernel oops in the network namespace teardown path; pod was stuck in Terminating state; PLEG poll could not complete.

## What Worked
- Layered debugging found the layer in 9 minutes (faster than typical for our team).
- kubectl debug node was fast to identify the stuck process.

## What Did Not Work
- Initial 5 minutes were wasted on app-layer hypothesis.
- Runbook for "PLEG unhealthy" did not exist; engineer had to figure it out live.

## Action Items
- [ ] [Platform team / Alice / by 2026-05-15] Add "PLEG unhealthy" to the node debugging runbook
- [ ] [Platform team / Bob / by 2026-05-30] Investigate the kernel oops; file upstream bug if appropriate
- [ ] [SRE team / Carol / by 2026-06-15] Add monitoring alert for PLEG unhealthy specifically
- [ ] [Process / engineering manager / by 2026-05-15] Update on-call training to start with cluster events feed

Specific, owned, dated. The action items are followups that get done.

Summary

Symptom-fix vs root-cause is the difference between debugging the same incident once and debugging it forever. The discipline:

Mitigate first (stop user impact).
Reconstruct the timeline (what happened minute-by-minute).
Identify the trigger (what entered the system).
Map cause to effect (the chain from trigger to symptom).
Categorize the cause (bug / config / capacity / process / architecture).
Plan and execute the fix (specific, owned, dated).

The patterns to avoid: restart-as-fix without finding why; "we'll be more careful"; declaring "one-off" without ruling out recurrence.

The next lesson is the toolkit: the specific commands and tools you reach for at each layer of investigation.

The Layered Debugging Approach

Continue

The Investigation Toolkit

←→ navigateM toggle sidebar