Your Critical Pod Got OOMKilled. The Pod That Caused It Is Still Running. Here Is Why.
A node runs out of memory. The kernel and the kubelet both pick which pod to kill. Neither of them picks the leaky one. They pick the well-behaved BestEffort pod next door. The QoS, oom_score_adj, and eviction-priority story most engineers never learn.
A pager goes off. payments-api is down. You check the events:
Warning Killing 6s kubelet Memory cgroup out of memory:
Killed process 12345 (payments-api),
total-vm:512000kB, anon-rss:280000kB,
oom_score_adj:1000
You check the actual memory usage of payments-api. It was using 280MB out of a 512MB limit. It was nowhere near its limit. Why was it killed?
You look at what else was running on the node and find it: data-pipeline-worker. It is running fine, currently at 7.5GB out of an 8GB limit. The node has 16GB total. Total pod memory request was 12GB. The node ran out of memory because data-pipeline-worker blew through its working set, and the kernel needed to free pages, so it killed... payments-api.
The misbehaving pod is still running. The well-behaved one died. The on-call gets to explain this to the customer team.
This is the OOM-kills-the-wrong-pod problem, and it is very common. Two different systems decide which pod to kill (the kernel and the kubelet), they use different criteria, and neither of them is "the pod that caused the problem." This post is the actual algorithm both of them use, why "well-behaved BestEffort" is the worst possible status, and how to set things up so the right pod dies when memory runs out.
Two killers, not one#
There are two completely different mechanisms that kill pods on memory pressure. Understanding which one fired is the difference between debugging the right thing and the wrong thing.
1. Kernel-level cgroup OOM killer. When a pod exceeds its own limits.memory, the cgroup OOM killer fires. It kills processes inside that pod (typically the main one) until the cgroup is back under its limit. This kill is local to one pod; other pods are unaffected.
2. Kubelet-level eviction. When the node itself is under memory pressure (low free memory regardless of any single pod's limit), the kubelet picks a pod and evicts it. The kubelet uses different criteria than the kernel.
The motivating scenario at the top is variant 2: node-level pressure, kubelet evicts. The kernel did not pick payments-api because of its memory usage; the kubelet did, because of its QoS class.
How the kernel picks (cgroup-level OOM)#
Inside a single pod's cgroup, when the cgroup hits its memory limit and cannot reclaim pages, the kernel picks a process to kill. The selection uses oom_score:
oom_score = oom_badness(process)
= process_memory + oom_score_adj
Higher oom_score = more likely to be killed. The kernel iterates over candidates and picks the highest.
For a pod with one main container, this is straightforward: the main container's main process is almost always picked. For multi-container pods, the container using the most memory inside the cgroup is picked. This part is rarely surprising.
The interesting part is the oom_score_adj value, which is set by the kubelet at pod start based on the pod's QoS class. This value also factors into node-level kernel OOM (when the node OOM killer runs across all processes on the host), so understanding it matters for both.
QoS classes: how Kubernetes labels pods#
Every pod gets a Quality of Service class assigned automatically based on its resource configuration:
Guaranteed: every container has both requests and limits set, and requests == limits for both CPU and memory.
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1" # equal to request
memory: "1Gi" # equal to request
Burstable: at least one container has a request set, but it is not Guaranteed. Most "normal" pods land here. Examples: only memory request set, or memory request != limit.
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "1" # not equal to request
memory: "512Mi" # not equal to request
BestEffort: no requests or limits on any container. Looks innocent. Is dangerous.
resources: {} # nothing set
How QoS becomes oom_score_adj#
The kubelet maps QoS class to a Linux oom_score_adj (the per-process adjustment):
| QoS class | oom_score_adj |
|---|---|
| BestEffort | 1000 (highest priority for killing) |
| Burstable | 1000 - (1000 * memory_request / node_capacity), clamped to [2, 999] |
| Guaranteed | -997 (almost never killed) |
Read this carefully. A BestEffort pod is killed before any other pod, regardless of which pod is actually misbehaving. The kernel sees oom_score_adj=1000 on every process in that pod and picks them first.
The Burstable formula is the subtle one. A Burstable pod with a tiny memory request (say 64Mi on a node with 64Gi capacity) gets:
oom_score_adj = 1000 - (1000 * 64Mi / 64Gi) = 1000 - 0.97 ≈ 999
That is essentially as bad as BestEffort. A Burstable pod with a large memory request (say 8Gi on a 64Gi node) gets:
oom_score_adj = 1000 - (1000 * 8Gi / 64Gi) = 1000 - 125 = 875
Lower, but still high. Guaranteed pods are at -997, which makes them effectively invulnerable to OOM kill compared to anything else.
This is where the wrong pod gets picked. The misbehaving Burstable pod with a 8Gi request has oom_score_adj = 875. The well-behaved BestEffort pod next door has oom_score_adj = 1000. When the kernel runs out of memory, it picks the higher score. The well-behaved pod dies. The misbehaving one keeps running.
How the kubelet picks (node-level eviction)#
The kubelet's eviction logic kicks in before the node hits hard memory pressure. It runs based on configured thresholds:
--eviction-hard=memory.available<100Mi,nodefs.available<10%
--eviction-soft=memory.available<200Mi
--eviction-soft-grace-period=memory.available=2m
When a soft threshold is crossed for the grace period, or a hard threshold is crossed instantaneously, the kubelet picks pods to evict. The order:
- First, pods that exceed their
requests. The kubelet ranks these by "how much over request are they?" The biggest offender goes first, but only among pods over their request. - If no pod is over its request (or after exhausting that group), pods are picked by PriorityClass. Lower priority first.
- Within the same priority, pods are again ranked by usage minus request.
So the kubelet is somewhat smarter than the kernel: it does try to find the misbehaving pod first. But several scenarios still produce surprise outcomes:
- BestEffort pods (no request) are always considered "exceeding request" by definition (request is zero, any usage exceeds it). They become primary eviction targets even if they are not the heaviest user.
- A Burstable pod with usage at 95% of request but below limit is in a "well within limits" state but contributes to node pressure if node was overcommitted.
- PriorityClass affects only the kubelet eviction, not the kernel OOM killer. A high-priority Guaranteed pod is safe from both. A high-priority BestEffort pod is safe from kubelet but not from the kernel.
The combined effect: how the wrong pod gets killed#
A common production scenario:
- Node has 16Gi capacity.
- Pod A: payments-api, BestEffort (no requests/limits set, "we'll figure it out later"). Using 200Mi.
- Pod B: data-pipeline, Burstable with
requests.memory: 4Gi,limits.memory: 8Gi. Currently using 7.5Gi (working as designed but heavy). - Pod C: monitoring-agent, BestEffort. Using 100Mi.
- System and kube-system: ~1Gi.
Total: ~9Gi used on a 16Gi node. Plenty of headroom, right?
Then a request comes in that pushes data-pipeline to 8Gi. Now the node has 9Gi pod usage out of 16Gi. Buffer cache, network buffers, kernel needs push memory pressure higher. Memory available drops below the eviction threshold. Kubelet evicts.
What gets evicted? The kubelet finds two BestEffort pods (any usage exceeds zero request, "biggest offender"). It picks the one using more memory. Pod A (payments-api) goes.
Or maybe it is the kernel OOM that fires before kubelet eviction. The kernel sees:
- Pod A: oom_score_adj = 1000, usage = 200Mi
- Pod B: oom_score_adj ≈ 750 (Burstable with 4Gi request on 16Gi node), usage = 8Gi
- Pod C: oom_score_adj = 1000, usage = 100Mi
Score = usage + adj. Pod A: 200 + 1000 = 1200. Pod B: 8000 + 750 = 8750. Pod C: 100 + 1000 = 1100.
In this case the math goes the other way: data-pipeline has the highest score (because it is using 8Gi), so it gets killed. But change the numbers slightly (Pod B requests 8Gi instead of 4Gi, so adj ≈ 500): Pod B score = 8000 + 500 = 8500, Pod A score = 200 + 1000 = 1200. Still Pod B in this case.
The more pernicious scenario: Pod B requests 12Gi (close to node capacity), so adj ≈ 250. Pod B score = 8000 + 250 = 8250. Pod A score = 200 + 1000 = 1200. Still Pod B... unless Pod A grows even slightly. Pod A at 1Gi: score = 1000 + 1000 = 2000. Still Pod B in this case.
Numbers vary. The general rule: Burstable pods with large memory requests are partially shielded (high request → low oom_score_adj). BestEffort pods are always near the top of the kill list regardless of how little memory they actually use.
How to set this up correctly#
Three rules.
Rule 1: never run BestEffort in production. Always set at least memory requests and limits. The minimum:
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
memory: "512Mi" # cap to prevent runaway
This puts the pod in Burstable with a low oom_score_adj near 992 (still high). To lower it further, raise requests.memory to be closer to the actual working set.
Rule 2: critical pods should be Guaranteed. For pods you absolutely cannot afford to lose:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m" # equal to request
memory: "512Mi" # equal to request
This makes them Guaranteed (oom_score_adj = -997). The kernel will not kill them unless absolutely no other option exists.
Rule 3: use PriorityClass for scheduling priority and eviction priority. Higher priority pods are evicted last by the kubelet (does not affect kernel OOM directly):
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-payments
value: 100000
description: "Payments services - never evict before lower priority workloads"
---
apiVersion: v1
kind: Pod
spec:
priorityClassName: critical-payments
# ... rest of pod spec, ideally Guaranteed QoS
Combine all three: Guaranteed QoS + high PriorityClass + actual realistic requests = pod survives memory pressure unless the node is genuinely failing.
Don't forget system reservations#
Even with QoS done right, a node can run out of memory if you allow pod requests to use the entire node. Reserve some for the OS and the kubelet:
# kubelet config
systemReserved:
cpu: 500m
memory: 1Gi
kubeReserved:
cpu: 500m
memory: 1Gi
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
This means: out of a 16Gi node, only 16Gi - 1Gi (system) - 1Gi (kube) - 200Mi (eviction reserve) ≈ 13.6Gi is available for pods. The scheduler accounts for this; pods can collectively request up to 13.6Gi on this node, not 16Gi.
Without these reservations, pod requests can equal node capacity, and the kubelet has nothing to keep itself running when memory is tight.
Detecting and diagnosing in real time#
When a pod gets OOMKilled:
# Step 1: check Last State for the pod
kubectl describe pod $POD | grep -A 10 "Last State"
# Look for:
# - Reason: OOMKilled (cgroup OOM)
# - Reason: Evicted, Message: The node was low on resource: memory (kubelet eviction)
# Step 2: check pod events
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD
# Step 3: check oom_score_adj on the running container
PID=$(kubectl exec $POD -- pgrep -f main-process | head -1)
kubectl exec $POD -- cat /proc/$PID/oom_score_adj
If you see oom_score_adj: 1000 and the pod still shouldn't be the kill target, you have a BestEffort or low-request Burstable pod. Add proper requests.
For node-level eviction events:
# Check if the kubelet evicted anything recently
kubectl get events --all-namespaces --field-selector reason=Evicted
# Check node memory pressure status
kubectl describe node $NODE | grep -A 5 "Conditions"
Look for MemoryPressure: True. If yes, the kubelet was actively evicting. Check what it picked and ask whether the right pods would have been picked.
A subtle one: containers within a pod#
The OOM killer operates on processes, not pods. If a pod has multiple containers, the kernel can kill any of them. The kubelet sets oom_score_adj per-container based on the QoS of the pod (all containers in a pod share the QoS), but if memory limits are different per container, the offending one is more likely to be picked first.
For a sidecar pattern (e.g., main app + Envoy sidecar), if the Envoy starts leaking, the main app might get killed first because it has higher absolute memory usage. This is rare but occasionally happens with badly-configured sidecars. Set sane memory limits on every container in the pod, not just the main one.
Quick reference: the OOM checklist#
1. When a pod is OOMKilled, classify the kill:
- kubectl describe pod -> Last State Reason
OOMKilled = kernel cgroup OOM (pod exceeded its own limit)
Evicted = kubelet eviction (node memory pressure)
2. If kernel cgroup OOM:
- Check pod's actual memory usage vs limit
- Look for memory leak, container heap growing
- Raise limits if working set is genuinely larger than thought
- Add memory profiling if it's a leak
3. If kubelet eviction:
- Check node memory pressure conditions
- Identify the actual heavy-memory pod (might not be the killed one)
- Verify QoS class of killed pod (kubectl describe -> QoS Class)
- BestEffort or Burstable with low request? -> raise requests
- Guaranteed? -> probably system overcommitted, look at scheduling
4. Set up properly:
- No BestEffort in production
- Critical pods = Guaranteed QoS + high PriorityClass
- system-reserved + kube-reserved + eviction-hard configured
- All containers in a pod have memory limits
5. Monitor:
- container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85
- node_memory_MemAvailable_bytes < eviction-hard threshold
- Alert on any Evicted event in production namespaces
What to actually monitor#
Three metrics worth alerting on:
# 1. Pods near their memory limit (warning before OOM)
container_memory_working_set_bytes
/ on(container, pod) container_spec_memory_limit_bytes > 0.85
# 2. Node memory pressure (proxy for incoming evictions)
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1
# 3. Any eviction event in production namespaces
increase(kube_pod_status_reason{reason="Evicted"}[10m]) > 0
Layered with the QoS rules, these three alerts catch nearly every OOM-related incident before it cascades.
The mental model#
Two systems pick which pod to kill on memory pressure: the kernel (uses oom_score_adj derived from QoS) and the kubelet (uses request-overshoot ranking and PriorityClass). They have similar but not identical preferences. Both penalize BestEffort pods. Both reward Guaranteed pods. Both are roughly OK at finding actual misbehavers but not perfectly.
The fix is upstream: make sure your QoS classes match your importance hierarchy. Critical pods are Guaranteed with high PriorityClass. Normal pods are Burstable with realistic requests. BestEffort is for explicit "this can die at any time" workloads (cron jobs that re-run, debug tools), not for "we forgot to set requests."
The OOM kill that hurts in production is usually a configuration mistake from months ago that nobody connected to today's incident. Audit your QoS classes once and most of the surprise goes away.
The kernel-side mechanics (cgroups, oom_score, /proc/PID/oom_score_adj, memory accounting) are covered in detail in the Linux Fundamentals course. The Kubernetes-side debugging patterns (eviction reasons, node pressure, kubectl describe forensics) are part of the Kubernetes Debugging course.
More in Kubernetes Debugging
Your Kubernetes Cluster Just Died at 2 AM: The Certificate Nobody Was Watching
Kubernetes certificates expire silently. No warning, no alert, no graceful degradation, just a dead cluster. Here is how to fix it in five minutes and how to make sure it never happens again.
Read postHow to Debug Kubernetes OOMKilled (Exit Code 137): The Complete Guide
Three completely different problems hide behind exit code 137. Most engineers fix the wrong one and the pod keeps crashing.
Read post