Kubernetes Debugging for SREs

The Investigation Toolkit

You have the layered framework (lesson 1.1) and the symptom-vs-cause discipline (lesson 1.2). This lesson is the tools. The specific commands you reach for at each layer when you need to inspect what is happening.

By the end, you should know what tool to use for each kind of question — and the muscle memory to type the command without looking it up at 3 AM.

KEY CONCEPT

The toolkit for Kubernetes debugging is small but specific. Five tools cover 90 percent of investigations: kubectl (the apiserver client), crictl (the runtime client), journalctl (node-level logs), tcpdump (packet capture), and ephemeral containers via kubectl debug (live pod inspection). Each tool corresponds to a layer; muscle memory matters because typing speed under pressure determines how fast you find the cause.

The toolkit, by layer

The matchup of tools to layers:

A tour of each, with the specific commands and patterns.

kubectl: the apiserver client

The Swiss Army knife. The commands you actually use during an incident:

Get state

# Cluster-wide pod status
kubectl get pods -A -o wide

# A specific namespace
kubectl get pods -n prod-checkout

# Pods on a specific node
kubectl get pods -A --field-selector spec.nodeName=ip-10-0-3-21

# Pods that aren't running cleanly
kubectl get pods -A --field-selector status.phase!=Running

# Watch live (great for tracking a fix in real-time)
kubectl get pods -n prod-checkout --watch

Describe (the most useful command in incidents)

# Pod-level detail
kubectl describe pod -n prod-checkout checkout-api-abc123

# Node-level detail
kubectl describe node ip-10-0-3-21

# Service-level detail (helpful for endpoint debugging)
kubectl describe svc -n prod-checkout checkout-api

# Deployment / rollout history
kubectl describe deployment -n prod-checkout checkout-api
kubectl rollout history deployment -n prod-checkout checkout-api

The Events section at the bottom of describe is gold. Read it. Most pod-level issues surface as a clear event.

Logs

# Live tail
kubectl logs -n prod-checkout checkout-api-abc123 --tail=200 --follow

# Previous container (if it crashed and restarted)
kubectl logs -n prod-checkout checkout-api-abc123 --previous

# Multi-container pod
kubectl logs -n prod-checkout checkout-api-abc123 -c app

# All replicas of a Deployment
kubectl logs -n prod-checkout -l app=checkout-api --tail=200 --max-log-requests=20

# Since a specific time
kubectl logs -n prod-checkout checkout-api-abc123 --since=15m
kubectl logs -n prod-checkout checkout-api-abc123 --since-time='2026-04-25T14:23:00Z'

The --max-log-requests flag is needed for multi-replica tailing; default is 5 streams which limits how many pods you see.

Events

# Cluster-wide events, most recent first
kubectl get events -A --sort-by=.lastTimestamp | tail -30

# Events for a specific resource
kubectl get events -n prod-checkout --field-selector involvedObject.name=checkout-api-abc123

# Warnings only
kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp | tail -30

--sort-by=.lastTimestamp is essential. The default sort is creation time, which is not what you want during an incident.

Exec and port-forward

# Shell into a pod
kubectl exec -it -n prod-checkout checkout-api-abc123 -- /bin/sh

# Run a one-shot command
kubectl exec -n prod-checkout checkout-api-abc123 -- nslookup payments.prod.svc.cluster.local

# Port-forward to a pod for direct access
kubectl port-forward -n prod-checkout checkout-api-abc123 8080:8080

# Same to a service
kubectl port-forward -n prod-checkout svc/checkout-api 8080:80

kubectl exec requires the image to have a shell; distroless images do not. For those, ephemeral containers (next section) are the answer.

Patch and edit

# Quick label add
kubectl label pod -n prod-checkout checkout-api-abc123 debug=true

# Edit a Deployment
kubectl edit deployment -n prod-checkout checkout-api

# Patch a specific field
kubectl patch deployment -n prod-checkout checkout-api \
  --type=json -p='[{"op":"replace","path":"/spec/replicas","value":5}]'

Edit and patch are mitigation tools: temporarily change something to recover, fix properly later via Git. Useful in a pinch but should not be the long-term path.

Configuration tips

A few aliases worth setting up:

# In your shell config
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgpa='kubectl get pods -A'
alias kdesc='kubectl describe'
alias kev='kubectl get events -A --sort-by=.lastTimestamp | tail -30'
alias klog='kubectl logs --tail=200'

# Tab completion for kubectl
source <(kubectl completion bash)  # or zsh

kubectl ctx (kubectx) and kubectl ns (kubens) for context switching are also worth installing. During multi-cluster incidents, switching is constant.

kubectl debug: ephemeral containers

The newer tool that solves the "I cannot get into this distroless container" problem.

# Add a debug container to a running pod
kubectl debug -it -n prod-checkout checkout-api-abc123 \
  --image=nicolaka/netshoot --target=app

# Without --target, the debug container shares the pod but not the same process namespace
# With --target, it shares the target container's process namespace (so ps/strace work)

The debug container shares the pod's network and (with --target) process namespaces. From inside, you can:

tcpdump on the application's network interfaces.
strace the application process.
nslookup to test DNS resolution.
curl to verify connectivity.
All without baking debug tools into the production image.

nicolaka/netshoot is the standard debug image — has tcpdump, dig, curl, traceroute, strace, mtr, etc.

For a debug pod that does not attach to an existing pod, just run a fresh netshoot pod:

kubectl run -it --rm netshoot --image=nicolaka/netshoot --restart=Never -- /bin/bash

This is the "throwaway pod for diagnostics" pattern.

kubectl debug node

For node-level inspection without SSH:

kubectl debug node/ip-10-0-3-21 -it --image=nicolaka/netshoot

This drops you into a privileged pod on the target node with /host mounted and access to the host network namespace. From there:

# Inside the debug pod
chroot /host  # work in the node's filesystem

df -h                     # disk usage
free -m                   # memory
top                       # process activity
journalctl -u kubelet     # kubelet logs
journalctl -u containerd  # runtime logs

kubectl debug node is the modern alternative to SSH. RBAC-controlled (only break-glass users should have it), audited (the apiserver logs the debug pod creation), and works on managed clusters where SSH is not possible.

crictl: the runtime client

When the kubelet says it cannot do something, you go one layer down to the runtime. crictl talks to containerd / CRI-O directly.

# List all sandboxes (pods at the runtime level)
crictl pods

# List all containers
crictl ps -a

# Inspect a container
crictl inspect <container-id>

# Get runtime-level logs
crictl logs <container-id>

# Run a command
crictl exec -it <container-id> sh

# Pull / list images
crictl pull image:tag
crictl images
crictl rmi <image-id>

crictl is essential when:

The kubelet is misbehaving and kubectl exec does not work.
A container is in a stuck state that the kubelet cannot resolve.
You need to see what the runtime sees vs what the kubelet sees.

You need to be on the node (or in kubectl debug node) for crictl to work — it talks to the local runtime socket.

journalctl: node-level logs

Linux's standard log query tool. On Kubernetes nodes:

# kubelet logs
journalctl -u kubelet --since="10 minutes ago"
journalctl -u kubelet --since="1 hour ago" | grep -i error

# containerd logs
journalctl -u containerd --since="10 minutes ago"

# All system logs since the last boot
journalctl -b

# Kernel logs
journalctl -k
journalctl -k --since="1 hour ago" | grep -i oom

# Tail and follow
journalctl -u kubelet -f

Useful flags:

--since and --until for time windows.
-u <unit> for a specific systemd service.
-k for kernel messages only.
-p err for errors and above.
-f to follow live.

dmesg is the older equivalent for kernel messages; on modern systems journalctl -k is preferred but dmesg still works.

tcpdump: packet capture

For network debugging, the truth is in the packets. tcpdump captures them.

# Inside the pod (via ephemeral container or netshoot pod)
tcpdump -i eth0 -nn

# Specific port
tcpdump -i eth0 -nn port 53      # DNS
tcpdump -i eth0 -nn port 8080    # HTTP

# Specific host
tcpdump -i eth0 -nn host 10.244.5.6

# Save for later analysis
tcpdump -i eth0 -nn -w /tmp/capture.pcap

# Read a saved capture
tcpdump -r /tmp/capture.pcap -nn

Useful flags:

-i <interface> to specify interface (any for all).
-nn to disable name resolution (faster output and more honest about IPs).
-w <file> to save to a file (.pcap format).
-c <count> to stop after N packets.
-vvv for verbose output.

Filter examples:

# Failed TCP connections
tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-rst|tcp-fin) != 0'

# Slow DNS responses
tcpdump -i eth0 -nn port 53 -A

For most pod-to-pod debugging, run tcpdump in an ephemeral container on the source pod, then on the destination pod, and verify packets flow as expected.

kubectl get with custom-columns and JSON queries

Useful for advanced queries:

# Custom columns
kubectl get pods -A -o custom-columns="NAME:.metadata.name,NAMESPACE:.metadata.namespace,NODE:.spec.nodeName,STATUS:.status.phase"

# JSON path
kubectl get pods -n prod-checkout -o jsonpath='{.items[*].spec.nodeName}'

# Pipe through jq
kubectl get pods -A -o json | jq '.items[] | select(.status.phase != "Running") | .metadata.name'

# Find pods using more memory than their request
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].resources.requests.memory != null)'

These are slower under pressure because of the typing, but useful for analysis post-incident.

kubectl tools worth installing

Beyond core kubectl, a few plugins:

kubectx / kubens: fast context and namespace switching.
kubectl-tree: shows ownerReference trees (Deployment to ReplicaSet to Pod).
kubectl-tail: better log tailing across multiple pods.
stern: multi-pod log tailing (similar to kubectl-tail).
kubectl-cnpg: for CloudNativePG-managed Postgres clusters.
k9s: terminal UI for cluster navigation. Great for "what's running where."
lens: GUI alternative; useful for visual pattern recognition.

Install with krew:

kubectl krew install ctx ns tree

For incident-time navigation, k9s plus tab completion is hard to beat. Less typing per command means more time investigating.

Cloud CLI tools

For cloud-layer issues:

AWS

# Status of an EC2 instance
aws ec2 describe-instances --instance-ids i-abcdef1234

# Spot interruption notices
aws ec2 describe-spot-instance-requests

# Service health
aws health describe-events --filter eventTypeCategories=issue

# IAM role used by the kubelet
aws sts get-caller-identity --profile <kubelet-profile>

GCP

# Compute instance status
gcloud compute instances describe <instance> --zone=us-central1-a

# Health
gcloud beta logging read 'resource.type="k8s_node"' --limit=20

# IAM
gcloud auth list

Azure

# VM status
az vm show --resource-group <rg> --name <vm-name>

# AKS cluster status
az aks show --resource-group <rg> --name <cluster>

# Activity log
az monitor activity-log list --max-events 20

The cloud CLI is essential when the cause is below Kubernetes — IAM, networking, quotas, instance-level issues.

Metrics and observability

Beyond commands, the dashboards and metrics that matter:

apiserver metrics

# From a pod with kubectl access:
kubectl get --raw /metrics | grep apiserver_request_duration_seconds
kubectl get --raw /metrics | grep apiserver_admission_step

Or directly via Prometheus:

# Apiserver request latency p99
histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m]))

# Apiserver throttling (priority and fairness)
sum(apiserver_flowcontrol_request_concurrency_in_use) by (flowSchema)

# Failed admission webhook calls
sum(rate(apiserver_admission_webhook_request_total{rejected="true"}[5m])) by (name)

CoreDNS metrics

# Query duration p99
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

# Cache hit ratio
sum(rate(coredns_cache_hits_total[5m])) /
  (sum(rate(coredns_cache_hits_total[5m])) + sum(rate(coredns_cache_misses_total[5m])))

# NXDOMAIN responses
sum(rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m]))

Node metrics

# Node CPU pressure
rate(node_cpu_seconds_total{mode="iowait"}[5m])

# Node memory pressure
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk pressure
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

These are baseline queries; build dashboards from them and refer to them during incidents.

A typical incident's tool sequence

Putting it all together, an example sequence for a pod-level issue:

# 1. Orient: what's broken cluster-wide?
kubectl get events -A --sort-by=.lastTimestamp | tail -20

# 2. Specific failing workload
kubectl get pods -n prod-checkout

# 3. Why is it failing?
kubectl describe pod -n prod-checkout checkout-api-abc123

# 4. What did the previous run say?
kubectl logs -n prod-checkout checkout-api-abc123 --previous

# 5. Live check from inside the pod
kubectl debug -it -n prod-checkout checkout-api-abc123 --image=nicolaka/netshoot --target=app
nslookup database.prod.svc.cluster.local
curl -v http://database.prod.svc.cluster.local:5432

# 6. If pod is fine but the node is suspect
kubectl describe node ip-10-0-3-21

# 7. Node-level if needed
kubectl debug node/ip-10-0-3-21 -it --image=nicolaka/netshoot
chroot /host
journalctl -u kubelet --since="15 minutes ago"

Five to seven commands; under five minutes once you know them by heart.

Building muscle memory

The toolkit only matters if you can use it under pressure. The patterns:

Practice on non-incident days

When the cluster is calm, run the diagnostic commands. Familiarity with the output (what does a healthy kubectl describe node look like?) is necessary for spotting abnormalities.

Make a personal cheat sheet

A short document with the 10-15 commands you find yourself looking up. Tape it to your monitor or paste it in a Slack channel. No shame; even senior engineers reference it during incidents.

Run game days

Module 11 of Production Kubernetes Operations covered this; the gist: scheduled chaos exercises that force the team to use the tools. Builds muscle memory in a controlled setting so the tools are ready when the real incident hits.

Update the kubectl alias

Every time you find yourself typing the same long flag combination, alias it. Speed matters; reduce typing.

WAR STORY

A team I joined had a senior engineer who could diagnose any Kubernetes issue in under 10 minutes. Watching him work, I noticed: he typed almost nothing. He had aliases, kubectx, k9s, and a personal cheat sheet open in a side window. His "investigation" looked like: type three letters, look at the screen, type three more letters. Watching me work was a different speed: typing full kubectl commands, looking up flags, switching contexts manually. The diagnostic skill was the same; the typing speed was 3x. Lesson: the toolkit you can use without thinking is faster than the toolkit you have to look up. Build muscle memory before you need it.

Summary

The Kubernetes investigation toolkit is small but specific:

kubectl: get, describe, logs, events, exec — the workhorses.
kubectl debug: ephemeral containers and node-level debug pods.
crictl: runtime-level inspection on a node.
journalctl: node-level logs (kubelet, containerd, kernel).
tcpdump: packet-level network debugging.
Cloud CLI: when the cause is below Kubernetes.
Prometheus / metrics: trends and anomalies.

The skill is not memorizing every flag. It is building muscle memory so the right command takes seconds to type, not minutes. Practice on calm days; build aliases; install k9s and kubectx; keep a cheat sheet.

Module 1 closes here. Module 2 starts the deep dive on specific debugging scenarios — the three flavors of pod failure and how to diagnose each.

Symptoms vs Root Causes

Continue

The Pod Won't Start

←→ navigateM toggle sidebar