How to Debug Kubernetes OOMKilled (Exit Code 137): The Complete Guide

Three completely different problems hide behind exit code 137. Most engineers fix the wrong one and the pod keeps crashing.

By Sharon Sahadevan·April 26, 2026·11 min read

Your pod crashed. You run kubectl describe and see this:

State:       Running
Last State:  Terminated
Reason:      OOMKilled
Exit Code:   137

The team Slack lights up. Someone says "we need a bigger memory limit." Someone else says "the model is too large." Both might be wrong.

Here is the trap most engineers fall into. There are multiple completely different failures that surface as OOMKilled with exit code 137. They have different root causes and different fixes. If you treat them as the same problem, you will fix the wrong thing and the pod will crash again.

This guide covers all three scenarios you will hit in production:

Scenario A: Container exceeded its own cgroup memory limit
Scenario B: Node ran out of memory and kubelet evicted your pod
Scenario C: GPU VRAM exhaustion (CUDA out of memory)

Each one looks similar at first glance. Each one needs a different fix.

Scenario A: Container exceeded its own cgroup memory limit#

This is the most common case. The Linux kernel killed your container because the cgroup memory limit was exceeded. Your process tried to allocate more RAM than the pod was allowed.

The signal is clear. Check the exit code and the events. Run kubectl describe pod my-pod and look at the events section. You will see something like "Container my-app was OOMKilled" along with restart backoff messages.

The key phrase is "Container was OOMKilled." That tells you the kernel killed your specific container because it exceeded its cgroup memory limit. Other containers on the node are fine. The node has plenty of free memory. The kernel sent SIGKILL specifically to your container because cgroup accounting tracked it crossing its limit.

This is not a node issue. It is your container's allocation pattern.

Why this happens during model loading#

Most production OOMs happen during model loading, and there is a specific reason. When you load a 14 GB model, the weights briefly exist in both host RAM and GPU VRAM at the same time. The model file is read into RAM, then transferred to the GPU. For a few seconds, you need 14 GB of RAM and 14 GB of VRAM simultaneously.

If your pod has limits.memory set to 16 GB, you are right at the edge. Add Python interpreter overhead, library imports, request buffers, and you are over the limit. SIGKILL fires. Pod crashes. Restart. Same thing on the next pull-into-VRAM. CrashLoopBackOff.

The fix for Scenario A#

Set your memory limit using this formula:

limits.memory = (model size in GB × 1.5) + 4 GB

For a 14 GB model, set the limit to roughly 25 GB. The 1.5× factor accounts for the dual residency during loading. The 4 GB buffer covers Python, libraries, and request handling.

Increasing the GPU memory does nothing here. Buying a bigger GPU does nothing here. Raising memory requests does nothing here. The fix is in the Kubernetes resource spec, specifically limits.memory.

What about the requests vs limits relationship?#

When you raise limits, also consider raising requests. Here is why this matters.

Setting requests equal to limits gives your pod Guaranteed QoS class. This protects you from being evicted during node memory pressure (Scenario B, below). If requests is significantly less than limits, you get Burstable QoS, and your pod becomes more likely to be evicted before it even hits its own limit.

For most production workloads with consistent memory usage, set requests equal to limits. The exception is workloads with predictable bursts (JVM apps with full GC cycles, batch jobs with brief peaks), where requests can be lower than limits to allow node-level oversubscription.

The deeper mechanics of cgroup accounting (what counts toward memory.current, why your dashboard graph disagrees with the OOM killer, cgroup v1 vs v2 differences) are covered in cgroups, Pod Memory Limits, and What Actually Gets Counted.

Scenario B: Node ran out of memory (kubelet eviction)#

This scenario produces the same exit code 137, but the root cause is completely different. Your container did not exceed its own limit. The node ran out of memory and the kubelet evicted your pod to reclaim resources.

This is the scenario most engineers miss because they assume exit code 137 always means cgroup OOM.

How to tell you are in Scenario B#

Run kubectl describe pod my-pod and look at the events. You will see different messages this time. Look for:

"The node was low on resource: memory"
"Evicted"
"Container my-app was using 4Gi, which exceeds its request of 1Gi"

The key phrases are "The node was low on resource: memory" and "Evicted." That is kubelet eviction, not cgroup OOM. Your pod was using 4 GB. Its limit was 8 GB. It never hit its own limit. But the node ran out of memory and the kubelet picked your pod as the sacrifice.

You can also check the node's recent events:

kubectl describe node $NODE | grep -A5 -i 'memory pressure'
kubectl get events --field-selector reason=Evicted -A --sort-by=.lastTimestamp | tail -20

MemoryPressure: True in the node conditions means the kubelet is actively evicting. The cluster-wide Evicted event list catches the cases where the pod has already been garbage-collected.

Why this happens#

Kubernetes is designed to allow overcommitment. The sum of all limits on a node can exceed the node's actual capacity, because not every pod uses its full limit at the same time. This works until it does not. When multiple pods spike at once, the node runs out of memory.

When a node hits the memory.available threshold (default 100 Mi), the kubelet starts evicting pods to recover. Eviction priority order:

BestEffort pods first (no requests or limits set), ranked by memory usage
Burstable pods next (requests less than limits), sorted by usage above their request
Guaranteed pods last (requests equal to limits), only as a last resort

If your pod is Burstable and using 4 GB above its 1 GB request, it gets evicted before a Burstable pod using 4 GB on a 4 GB request. The pod that is "closer to its request" is treated as more legitimate. Raising your requests can move you down the eviction queue without changing your actual memory usage by a byte.

The fix for Scenario B#

The fix here is different from Scenario A. You have three options.

Option 1: Raise requests to match actual usage. Set memory requests to 4 Gi and limits to 8 Gi. Higher requests means the scheduler places your pod on a node with that capacity available. It also moves your pod up the eviction priority order during node pressure.

Option 2: Set requests equal to limits. Set both memory requests and limits to 8 Gi. This gives your pod Guaranteed QoS class. It will be evicted last, after all BestEffort and Burstable pods on the node.

Option 3: Fix the underlying cluster overcommitment. If you are frequently hitting Scenario B, the deeper issue is cluster capacity. Either your nodes are too small, your workloads are over-provisioned, or your cluster autoscaler is not keeping up with demand. This is a capacity planning conversation, not a per-pod fix.

Don't apply Scenario A fixes to Scenario B problems#

This is the most important takeaway. If your pod is being evicted due to node memory pressure, raising limits.memory does nothing. The pod will still be evicted because the node still does not have enough memory.

I have watched teams burn hours raising limits on pods being evicted from overcommitted nodes. The pods keep dying. Of course they do. The fix was never on the pod. It was on the node.

Always check events before deciding the fix.

Scenario C: GPU VRAM exhaustion (CUDA out of memory)#

This one looks completely different from Scenarios A and B. The exit code is usually 1, not 137. Kubernetes did not kill the process. The CUDA runtime did.

Look in the pod logs. You will see something like:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.00 GiB. GPU 0 has a total capacity of
39.50 GiB of which 1.23 GiB is free.

This is the GPU running out of VRAM. The Linux kernel has nothing to do with it. The container's memory limit has nothing to do with it. The host could have 200 GB of free RAM and this would still happen.

CUDA OOM is almost always caused by KV cache exhaustion under load.

Your model fits at idle. With 8 GB of VRAM headroom you feel comfortable. Then concurrent requests pile up. The KV cache grows linearly with active sequences. By request number 200, the KV cache has consumed all remaining VRAM. The next request fails.

The fix for Scenario C#

The fixes are completely different from Scenarios A and B. Three knobs in your vLLM configuration matter:

gpu-memory-utilization controls how much VRAM vLLM is allowed to use. Lower it from 0.95 to 0.85 and you reserve 10 percent for safety margin. This is the most common fix.
max-num-seqs caps concurrent sequences. If you are getting CUDA OOM at high concurrency, lower this number. Throughput drops slightly. Crashes stop entirely.
max-model-len limits the maximum sequence length. If your KV cache is exhausting because of long sequences, this is your knob.

Increasing the pod's RAM limit does nothing here. The cluster autoscaler does nothing here. The fix is in the inference engine configuration.

The full tuning playbook for picking the right gpu-memory-utilization value, including the reproducible loop I run for every new (model, GPU, max_model_len) tuple, is in Tuning vLLM gpu_memory_utilization Without Breaking Production.

The diagnostic flowchart#

When a pod crashes, do not guess. Run this systematic check.

Step 1. Run kubectl describe pod and look at the events section.

Step 2. Read the message and identify the scenario.

If you see "Container was OOMKilled" you are in Scenario A. The cgroup limit was exceeded. Fix by raising limits.memory.
If you see "The node was low on resource: memory" or pod status Evicted you are in Scenario B. The kubelet evicted your pod. Fix by raising requests or fixing cluster capacity.
If there is no Kubernetes OOM message but logs show "CUDA out of memory" you are in Scenario C. GPU VRAM is exhausted. Fix by tuning inference engine configuration.
If there is no OOM message and the container exited with code 1, this is an application bug, not OOM. Read application logs from the start.

Step 3. Apply the right fix. Wrong fix means the pod will crash again.

Quick reference#

Symptom	Scenario	Root cause	Fix
Exit 137 with "Container was OOMKilled"	A	Container exceeded cgroup limit	Raise `limits.memory`
Exit 137 with "Node was low on resource: memory" or `Evicted`	B	Node ran out, kubelet evicted	Raise `requests`, or fix cluster capacity
Exit 1 with "CUDA out of memory" in logs	C	GPU VRAM exhausted	Tune `gpu-memory-utilization`, `max-num-seqs`, `max-model-len`

Why this matters#

I have seen teams burn three days debugging the wrong type of OOM. Engineers add more RAM to pods that are dying from CUDA OOM. They tune vLLM for pods that are dying from host memory. They raise limits on pods being evicted from overcommitted nodes. The pod keeps crashing. The team starts blaming the model, the GPU, the cloud provider.

The fix takes 30 seconds once you know which scenario you are dealing with. Check the events. Check the logs. Apply the right fix.

The deeper problem#

Most Kubernetes content treats OOM as one problem. Tutorials say "increase your memory limit" without explaining that this only fixes one type of OOM. Documentation lists CUDA errors without explaining when they apply. Engineers learn the layered nature of Kubernetes memory management through painful production incidents.

Production GPU and Kubernetes workloads have failure modes that a generic course will never teach you. The dual-residency pattern during model loading. KV cache exhaustion under concurrent load. The QoS class trap that affects eviction order. The cluster overcommitment problem that surfaces as random pod evictions. These are not edge cases. They are the daily reality of running production infrastructure at scale.

This is exactly why I built DevOpsBeast.

If you found this useful, the related courses go much deeper:

Kubernetes Debugging for SREs covers the full systematic debugging playbook including all three scenarios above plus dozens of other production failure modes.
Production GPU Infrastructure on Kubernetes covers Scenario C in depth alongside MIG partitioning, vLLM tuning, and the full GPU production playbook.
Production Kubernetes Operations covers Scenario B in depth alongside autoscaling, capacity planning, and the operational knowledge that prevents node pressure events.

All three are part of the production engineering catalog at devopsbeast.com.

If you have questions about edge cases not covered here, or if you have hit a scenario that does not fit cleanly into A, B, or C, send me a message on LinkedIn. The best updates to these articles come from engineers running real production workloads.