The Investigation Toolkit
You have the layered framework (lesson 1.1) and the symptom-vs-cause discipline (lesson 1.2). This lesson is the tools. The specific commands you reach for at each layer when you need to inspect what is happening.
By the end, you should know what tool to use for each kind of question — and the muscle memory to type the command without looking it up at 3 AM.
The toolkit for Kubernetes debugging is small but specific. Five tools cover 90 percent of investigations: kubectl (the apiserver client), crictl (the runtime client), journalctl (node-level logs), tcpdump (packet capture), and ephemeral containers via kubectl debug (live pod inspection). Each tool corresponds to a layer; muscle memory matters because typing speed under pressure determines how fast you find the cause.
The toolkit, by layer
The matchup of tools to layers:
A tour of each, with the specific commands and patterns.
kubectl: the apiserver client
The Swiss Army knife. The commands you actually use during an incident:
Get state
# Cluster-wide pod status
kubectl get pods -A -o wide
# A specific namespace
kubectl get pods -n prod-checkout
# Pods on a specific node
kubectl get pods -A --field-selector spec.nodeName=ip-10-0-3-21
# Pods that aren't running cleanly
kubectl get pods -A --field-selector status.phase!=Running
# Watch live (great for tracking a fix in real-time)
kubectl get pods -n prod-checkout --watch
Describe (the most useful command in incidents)
# Pod-level detail
kubectl describe pod -n prod-checkout checkout-api-abc123
# Node-level detail
kubectl describe node ip-10-0-3-21
# Service-level detail (helpful for endpoint debugging)
kubectl describe svc -n prod-checkout checkout-api
# Deployment / rollout history
kubectl describe deployment -n prod-checkout checkout-api
kubectl rollout history deployment -n prod-checkout checkout-api
The Events section at the bottom of describe is gold. Read it. Most pod-level issues surface as a clear event.
Logs
# Live tail
kubectl logs -n prod-checkout checkout-api-abc123 --tail=200 --follow
# Previous container (if it crashed and restarted)
kubectl logs -n prod-checkout checkout-api-abc123 --previous
# Multi-container pod
kubectl logs -n prod-checkout checkout-api-abc123 -c app
# All replicas of a Deployment
kubectl logs -n prod-checkout -l app=checkout-api --tail=200 --max-log-requests=20
# Since a specific time
kubectl logs -n prod-checkout checkout-api-abc123 --since=15m
kubectl logs -n prod-checkout checkout-api-abc123 --since-time='2026-04-25T14:23:00Z'
The --max-log-requests flag is needed for multi-replica tailing; default is 5 streams which limits how many pods you see.
Events
# Cluster-wide events, most recent first
kubectl get events -A --sort-by=.lastTimestamp | tail -30
# Events for a specific resource
kubectl get events -n prod-checkout --field-selector involvedObject.name=checkout-api-abc123
# Warnings only
kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp | tail -30
--sort-by=.lastTimestamp is essential. The default sort is creation time, which is not what you want during an incident.
Exec and port-forward
# Shell into a pod
kubectl exec -it -n prod-checkout checkout-api-abc123 -- /bin/sh
# Run a one-shot command
kubectl exec -n prod-checkout checkout-api-abc123 -- nslookup payments.prod.svc.cluster.local
# Port-forward to a pod for direct access
kubectl port-forward -n prod-checkout checkout-api-abc123 8080:8080
# Same to a service
kubectl port-forward -n prod-checkout svc/checkout-api 8080:80
kubectl exec requires the image to have a shell; distroless images do not. For those, ephemeral containers (next section) are the answer.
Patch and edit
# Quick label add
kubectl label pod -n prod-checkout checkout-api-abc123 debug=true
# Edit a Deployment
kubectl edit deployment -n prod-checkout checkout-api
# Patch a specific field
kubectl patch deployment -n prod-checkout checkout-api \
--type=json -p='[{"op":"replace","path":"/spec/replicas","value":5}]'
Edit and patch are mitigation tools: temporarily change something to recover, fix properly later via Git. Useful in a pinch but should not be the long-term path.
Configuration tips
A few aliases worth setting up:
# In your shell config
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgpa='kubectl get pods -A'
alias kdesc='kubectl describe'
alias kev='kubectl get events -A --sort-by=.lastTimestamp | tail -30'
alias klog='kubectl logs --tail=200'
# Tab completion for kubectl
source <(kubectl completion bash) # or zsh
kubectl ctx (kubectx) and kubectl ns (kubens) for context switching are also worth installing. During multi-cluster incidents, switching is constant.
kubectl debug: ephemeral containers
The newer tool that solves the "I cannot get into this distroless container" problem.
# Add a debug container to a running pod
kubectl debug -it -n prod-checkout checkout-api-abc123 \
--image=nicolaka/netshoot --target=app
# Without --target, the debug container shares the pod but not the same process namespace
# With --target, it shares the target container's process namespace (so ps/strace work)
The debug container shares the pod's network and (with --target) process namespaces. From inside, you can:
tcpdumpon the application's network interfaces.stracethe application process.nslookupto test DNS resolution.curlto verify connectivity.- All without baking debug tools into the production image.
nicolaka/netshoot is the standard debug image — has tcpdump, dig, curl, traceroute, strace, mtr, etc.
For a debug pod that does not attach to an existing pod, just run a fresh netshoot pod:
kubectl run -it --rm netshoot --image=nicolaka/netshoot --restart=Never -- /bin/bash
This is the "throwaway pod for diagnostics" pattern.
kubectl debug node
For node-level inspection without SSH:
kubectl debug node/ip-10-0-3-21 -it --image=nicolaka/netshoot
This drops you into a privileged pod on the target node with /host mounted and access to the host network namespace. From there:
# Inside the debug pod
chroot /host # work in the node's filesystem
df -h # disk usage
free -m # memory
top # process activity
journalctl -u kubelet # kubelet logs
journalctl -u containerd # runtime logs
kubectl debug node is the modern alternative to SSH. RBAC-controlled (only break-glass users should have it), audited (the apiserver logs the debug pod creation), and works on managed clusters where SSH is not possible.
crictl: the runtime client
When the kubelet says it cannot do something, you go one layer down to the runtime. crictl talks to containerd / CRI-O directly.
# List all sandboxes (pods at the runtime level)
crictl pods
# List all containers
crictl ps -a
# Inspect a container
crictl inspect <container-id>
# Get runtime-level logs
crictl logs <container-id>
# Run a command
crictl exec -it <container-id> sh
# Pull / list images
crictl pull image:tag
crictl images
crictl rmi <image-id>
crictl is essential when:
- The kubelet is misbehaving and
kubectl execdoes not work. - A container is in a stuck state that the kubelet cannot resolve.
- You need to see what the runtime sees vs what the kubelet sees.
You need to be on the node (or in kubectl debug node) for crictl to work — it talks to the local runtime socket.
journalctl: node-level logs
Linux's standard log query tool. On Kubernetes nodes:
# kubelet logs
journalctl -u kubelet --since="10 minutes ago"
journalctl -u kubelet --since="1 hour ago" | grep -i error
# containerd logs
journalctl -u containerd --since="10 minutes ago"
# All system logs since the last boot
journalctl -b
# Kernel logs
journalctl -k
journalctl -k --since="1 hour ago" | grep -i oom
# Tail and follow
journalctl -u kubelet -f
Useful flags:
--sinceand--untilfor time windows.-u <unit>for a specific systemd service.-kfor kernel messages only.-p errfor errors and above.-fto follow live.
dmesg is the older equivalent for kernel messages; on modern systems journalctl -k is preferred but dmesg still works.
tcpdump: packet capture
For network debugging, the truth is in the packets. tcpdump captures them.
# Inside the pod (via ephemeral container or netshoot pod)
tcpdump -i eth0 -nn
# Specific port
tcpdump -i eth0 -nn port 53 # DNS
tcpdump -i eth0 -nn port 8080 # HTTP
# Specific host
tcpdump -i eth0 -nn host 10.244.5.6
# Save for later analysis
tcpdump -i eth0 -nn -w /tmp/capture.pcap
# Read a saved capture
tcpdump -r /tmp/capture.pcap -nn
Useful flags:
-i <interface>to specify interface (anyfor all).-nnto disable name resolution (faster output and more honest about IPs).-w <file>to save to a file (.pcapformat).-c <count>to stop after N packets.-vvvfor verbose output.
Filter examples:
# Failed TCP connections
tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-rst|tcp-fin) != 0'
# Slow DNS responses
tcpdump -i eth0 -nn port 53 -A
For most pod-to-pod debugging, run tcpdump in an ephemeral container on the source pod, then on the destination pod, and verify packets flow as expected.
kubectl get with custom-columns and JSON queries
Useful for advanced queries:
# Custom columns
kubectl get pods -A -o custom-columns="NAME:.metadata.name,NAMESPACE:.metadata.namespace,NODE:.spec.nodeName,STATUS:.status.phase"
# JSON path
kubectl get pods -n prod-checkout -o jsonpath='{.items[*].spec.nodeName}'
# Pipe through jq
kubectl get pods -A -o json | jq '.items[] | select(.status.phase != "Running") | .metadata.name'
# Find pods using more memory than their request
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].resources.requests.memory != null)'
These are slower under pressure because of the typing, but useful for analysis post-incident.
kubectl tools worth installing
Beyond core kubectl, a few plugins:
- kubectx / kubens: fast context and namespace switching.
- kubectl-tree: shows ownerReference trees (Deployment to ReplicaSet to Pod).
- kubectl-tail: better log tailing across multiple pods.
- stern: multi-pod log tailing (similar to kubectl-tail).
- kubectl-cnpg: for CloudNativePG-managed Postgres clusters.
- k9s: terminal UI for cluster navigation. Great for "what's running where."
- lens: GUI alternative; useful for visual pattern recognition.
Install with krew:
kubectl krew install ctx ns tree
For incident-time navigation, k9s plus tab completion is hard to beat. Less typing per command means more time investigating.
Cloud CLI tools
For cloud-layer issues:
AWS
# Status of an EC2 instance
aws ec2 describe-instances --instance-ids i-abcdef1234
# Spot interruption notices
aws ec2 describe-spot-instance-requests
# Service health
aws health describe-events --filter eventTypeCategories=issue
# IAM role used by the kubelet
aws sts get-caller-identity --profile <kubelet-profile>
GCP
# Compute instance status
gcloud compute instances describe <instance> --zone=us-central1-a
# Health
gcloud beta logging read 'resource.type="k8s_node"' --limit=20
# IAM
gcloud auth list
Azure
# VM status
az vm show --resource-group <rg> --name <vm-name>
# AKS cluster status
az aks show --resource-group <rg> --name <cluster>
# Activity log
az monitor activity-log list --max-events 20
The cloud CLI is essential when the cause is below Kubernetes — IAM, networking, quotas, instance-level issues.
Metrics and observability
Beyond commands, the dashboards and metrics that matter:
apiserver metrics
# From a pod with kubectl access:
kubectl get --raw /metrics | grep apiserver_request_duration_seconds
kubectl get --raw /metrics | grep apiserver_admission_step
Or directly via Prometheus:
# Apiserver request latency p99
histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m]))
# Apiserver throttling (priority and fairness)
sum(apiserver_flowcontrol_request_concurrency_in_use) by (flowSchema)
# Failed admission webhook calls
sum(rate(apiserver_admission_webhook_request_total{rejected="true"}[5m])) by (name)
CoreDNS metrics
# Query duration p99
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))
# Cache hit ratio
sum(rate(coredns_cache_hits_total[5m])) /
(sum(rate(coredns_cache_hits_total[5m])) + sum(rate(coredns_cache_misses_total[5m])))
# NXDOMAIN responses
sum(rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m]))
Node metrics
# Node CPU pressure
rate(node_cpu_seconds_total{mode="iowait"}[5m])
# Node memory pressure
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk pressure
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
These are baseline queries; build dashboards from them and refer to them during incidents.
A typical incident's tool sequence
Putting it all together, an example sequence for a pod-level issue:
# 1. Orient: what's broken cluster-wide?
kubectl get events -A --sort-by=.lastTimestamp | tail -20
# 2. Specific failing workload
kubectl get pods -n prod-checkout
# 3. Why is it failing?
kubectl describe pod -n prod-checkout checkout-api-abc123
# 4. What did the previous run say?
kubectl logs -n prod-checkout checkout-api-abc123 --previous
# 5. Live check from inside the pod
kubectl debug -it -n prod-checkout checkout-api-abc123 --image=nicolaka/netshoot --target=app
nslookup database.prod.svc.cluster.local
curl -v http://database.prod.svc.cluster.local:5432
# 6. If pod is fine but the node is suspect
kubectl describe node ip-10-0-3-21
# 7. Node-level if needed
kubectl debug node/ip-10-0-3-21 -it --image=nicolaka/netshoot
chroot /host
journalctl -u kubelet --since="15 minutes ago"
Five to seven commands; under five minutes once you know them by heart.
Building muscle memory
The toolkit only matters if you can use it under pressure. The patterns:
Practice on non-incident days
When the cluster is calm, run the diagnostic commands. Familiarity with the output (what does a healthy kubectl describe node look like?) is necessary for spotting abnormalities.
Make a personal cheat sheet
A short document with the 10-15 commands you find yourself looking up. Tape it to your monitor or paste it in a Slack channel. No shame; even senior engineers reference it during incidents.
Run game days
Module 11 of Production Kubernetes Operations covered this; the gist: scheduled chaos exercises that force the team to use the tools. Builds muscle memory in a controlled setting so the tools are ready when the real incident hits.
Update the kubectl alias
Every time you find yourself typing the same long flag combination, alias it. Speed matters; reduce typing.
A team I joined had a senior engineer who could diagnose any Kubernetes issue in under 10 minutes. Watching him work, I noticed: he typed almost nothing. He had aliases, kubectx, k9s, and a personal cheat sheet open in a side window. His "investigation" looked like: type three letters, look at the screen, type three more letters. Watching me work was a different speed: typing full kubectl commands, looking up flags, switching contexts manually. The diagnostic skill was the same; the typing speed was 3x. Lesson: the toolkit you can use without thinking is faster than the toolkit you have to look up. Build muscle memory before you need it.
Summary
The Kubernetes investigation toolkit is small but specific:
- kubectl: get, describe, logs, events, exec — the workhorses.
- kubectl debug: ephemeral containers and node-level debug pods.
- crictl: runtime-level inspection on a node.
- journalctl: node-level logs (kubelet, containerd, kernel).
- tcpdump: packet-level network debugging.
- Cloud CLI: when the cause is below Kubernetes.
- Prometheus / metrics: trends and anomalies.
The skill is not memorizing every flag. It is building muscle memory so the right command takes seconds to type, not minutes. Practice on calm days; build aliases; install k9s and kubectx; keep a cheat sheet.
Module 1 closes here. Module 2 starts the deep dive on specific debugging scenarios — the three flavors of pod failure and how to diagnose each.