Kubernetes Performance Optimization

Resource Requests and Limits: The Foundation of Everything

Half your pods have no resource requests. The other half have requests set to 4 CPU and 8Gi memory "just to be safe." Your nodes are 70% allocated but only 15% utilized. Fix it.

If I could only teach engineers one Kubernetes performance topic, this would be it. Requests and limits are the load-bearing primitive that every later optimization rests on. Get them wrong and the cluster is simultaneously over-provisioned (high cost, low utilization) and unstable (OOMKills, throttling, eviction). Get them right and most of the other performance work in this course gets dramatically easier.

This lesson is what requests and limits actually do at the kernel level, why the wrong values cause specific failure modes, and how to reason about them.

The problem

The most common pattern I see in production:

30% of pods have no resource requests at all (BestEffort QoS)
50% have requests set to oversized round numbers ("4 CPU, 8Gi memory") because someone copied a template
15% have requests roughly matching reality
5% are intentionally over-requested for QoS reasons

The result is the spec at the top of this lesson: nodes are 70% allocated but only 15% utilized. The scheduler thinks the cluster is full because of the requests on paper. The actual hardware is sitting idle.

The cost of this is not theoretical. At a typical $0.05/hour for a vCPU on a hyperscaler, a 200-node cluster running at 15% utilization is wasting roughly $10K-30K per month on capacity that exists for no one. Right-sizing requests is one of the highest-ROI changes you can make to a cluster, and lessons 1.4 and 1.5 give you the tools to do it at scale.

But before you can right-size, you need to understand what requests and limits actually do. Confusion at this layer cascades into every later optimization decision.

How it works under the hood

requests and limits are two separate things that operate at two different layers. Treating them as the same field has different default values is the root cause of most production confusion.

What requests and limits actually control

limits.memory: kernel OOM enforcement

Written to memory.max in cgroup v2. Container processes that try to allocate above this get OOMKilled by the kernel. Hard wall, immediate, fatal.

limits.cpu: CFS bandwidth control

Written to cpu.max in cgroup v2 as quota per period. Container processes get throttled when they try to use more than their quota in a 100ms window. Soft wall, recoverable, recurring.

requests.memory: scheduler fit + QoS ranking

Used by the scheduler to fit the pod onto a node. Used by the kubelet to rank pods for eviction under node memory pressure. Not enforced as a runtime limit.

requests.cpu: scheduler fit + cgroup share

Used by the scheduler to fit the pod onto a node. Written to cpu.weight in cgroup v2 as the relative share of CPU under contention. Soft signal to the scheduler.

Hover to expand each layer

Two key consequences from this diagram:

requests are about the scheduler. limits are about the kernel. The scheduler uses requests to decide where the pod fits and what QoS class it gets. The kernel uses limits to decide what to throttle or kill at runtime. They are two different conversations, with two different consequences for getting them wrong.

The QoS class is derived from the relationship between requests and limits, and it controls eviction priority under node pressure:

QoS class	Requirement	Eviction priority
Guaranteed	requests == limits for both CPU and memory	Last to evict
Burstable	At least one request set, requests != limits	Middle, ranked by usage above request
BestEffort	No requests or limits set	First to evict

A pod with requests: {memory: 1Gi} and limits: {memory: 2Gi} is Burstable. The same pod with requests: {memory: 2Gi, cpu: 500m}, limits: {memory: 2Gi, cpu: 500m} is Guaranteed. The behavior under node memory pressure is dramatically different.

The CPU and memory enforcement mechanisms are also fundamentally different. Hitting your memory limit is fatal: the kernel OOMKiller fires, your process is gone, the container restarts. Hitting your CPU limit is recoverable but recurring: CFS bandwidth control throttles the process for the rest of the 100ms period, then the next period starts with a fresh quota. We cover CPU throttling end-to-end in lesson 1.5.

KEY CONCEPT

Memory and CPU are not symmetric. Over-limit memory means OOMKill (terminal). Over-limit CPU means throttling (latency spike). The right limits.memory is "real peak usage plus safety margin." The right limits.cpu is often "no limit at all" for non-multi-tenant workloads, because throttling causes more pain than the limit prevents. Confusing these is the source of half the production incidents around resource limits.

Diagnosis and measurement

Three diagnostic queries to run on any cluster you are evaluating:

1. Allocation vs utilization gap. This is the "wasting money" check.

# CPU allocation vs actual usage, by namespace
sum by (namespace) (
  kube_pod_container_resource_requests{resource="cpu"}
)
/
sum by (namespace) (
  rate(container_cpu_usage_seconds_total[5m])
)

A ratio above 3 means namespace requests are 3x actual usage. That is real waste. Above 5 is egregious.

2. QoS class distribution.

sum by (qos_class) (kube_pod_status_phase{phase="Running"})

If more than 20% of running pods are BestEffort, you have a stability risk: those pods are first to be evicted when the node gets memory pressure. If less than 30% are Guaranteed, you cannot promise stable behavior to latency-sensitive workloads.

3. OOMKill rate.

sum by (namespace) (
  rate(container_memory_oom_events_total[5m])
)

Anything above zero is a signal that limits are being hit. Investigate which pods, what their actual memory profile looks like, whether the limit needs to go up or the leak needs to be fixed.

For a single suspect pod, get the cgroup data directly:

# Find the container ID
crictl ps --name my-container

# Get the cgroup path
crictl inspect <container-id> | jq -r '.info.runtimeSpec.linux.cgroupsPath'

# On a cgroup v2 node
cat /sys/fs/cgroup/<path>/memory.current  # current memory usage in bytes
cat /sys/fs/cgroup/<path>/memory.max      # the limit
cat /sys/fs/cgroup/<path>/memory.events   # counters for high/max/oom events
cat /sys/fs/cgroup/<path>/cpu.stat        # nr_throttled, throttled_usec

This is the ground truth that no Prometheus query can match. The deeper cgroup mechanics are covered in cgroups, Pod Memory Limits, and What Actually Gets Counted.

The fix

The order in which you address resource problems matters. Cheap, reversible changes first:

1. Set requests.memory on every pod. No request means BestEffort QoS, which means the pod gets killed first when memory pressure hits. A reasonable starting point: take p95 actual memory usage from the last 7 days, add 25%, set as request.

2. Set limits.memory to a real number. OOMKills under your own limit are noisy but bounded. OOMKills caused by a node running out of memory take down whatever pods the kubelet picks. The first is preferable.

3. Set requests.cpu based on real usage, not vibes. Same approach as memory: p95 of actual usage plus headroom. The default value of "1 CPU" or "500m" out of a template is almost always wrong.

4. Carefully consider whether to set limits.cpu at all. This is the controversial one. CPU throttling causes latency spikes that look like the application is broken when it is just being prevented from using available CPU. For most workloads on dedicated nodes, omitting limits.cpu (so the pod can burst into spare capacity) produces better behavior than setting a tight limit. The cases where you do want a CPU limit:

Multi-tenant clusters where one pod cannot be allowed to monopolize CPU
Compliance requirements that mandate a hard cap
Workloads with explicit billing models tied to CPU consumption

For everything else, no limits.cpu plus a sensible requests.cpu (so the scheduler still bin-packs correctly) is often the right choice. Lesson 1.5 makes this case in detail.

A concrete example of a well-tuned pod spec for a moderate-load web service:

resources:
  requests:
    cpu: 500m       # actual p95 CPU during normal load
    memory: 768Mi   # actual p95 memory during normal load
  limits:
    memory: 1.5Gi   # 2x request, room for legitimate spikes
    # No cpu limit: let it burst into spare capacity

This pod is Burstable QoS, fits cleanly into the scheduler, and will not be OOMKilled or throttled under normal operation. If the node runs short on memory, it ranks better for survival than a similar pod with a 100m request and 4Gi limit (huge "usage above request" ratio at risk).

WAR STORY

A team I worked with had a high-traffic API service with limits.cpu: 1 and requests.cpu: 100m. Average CPU was 200ms. P99 latency was 4 seconds. The cause: brief CPU bursts (during garbage collection, for example) hit the 1-CPU CFS quota and got throttled for 50-80ms at a time. We removed limits.cpu entirely (kept requests.cpu at 200m for scheduling purposes). P99 latency dropped to 280ms. Average CPU went up slightly because the application was allowed to actually use the spare capacity it needed. Lesson: CPU limits cause more visible production pain than they prevent for most workloads. Set them deliberately, not by default.

Before and after

A right-sizing pass on a 200-pod namespace, typical results:

Metric	Before (default-template requests)	After (right-sized)
Sum of CPU requests	250 cores	85 cores
Sum of memory requests	480 GiB	220 GiB
Cluster CPU allocation	70%	35%
Cluster CPU actual utilization	15%	30%
Pod restarts due to OOMKill	4-10/day	0-1/day
BestEffort pod count	60	0
Guaranteed pod count (latency-sensitive)	5	22

The headline number is allocation dropping from 70% to 35% on the same actual workload. That is half the cluster freed up for other work or for scaling down.

Common mistakes

No requests set. BestEffort QoS, first to evict. Almost never the right answer for production workloads.
Requests set far above actual usage. Wastes capacity, looks busy when it is not. Audit the allocation vs usage ratio quarterly.
limits.cpu set on every pod by default. Causes throttling-induced latency spikes. Set deliberately for multi-tenant or billing reasons, omit otherwise.
limits.memory set tight to "save resources." When the legitimate spike comes, the pod OOMKills. Memory limits should cover real peak plus margin.
Equating requests with actual usage. Requests are a scheduling primitive, not a measurement. Actual usage is a separate signal from container_cpu_usage_seconds_total and container_memory_working_set_bytes.
Treating memory and CPU as symmetric. They have different enforcement (kill vs throttle), different consequences, and different right answers.
Ignoring QoS class. Production-critical workloads should be Guaranteed; batch and best-effort can be Burstable; BestEffort should be rare.

INTERVIEW QUESTION

Explain exactly what happens at the Linux kernel level when a container hits its CPU limit vs its memory limit. How does this affect application performance?

Kubernetes Metrics Deep Dive

Continue

Right-Sizing Workloads with VPA and Goldilocks

←→ navigateM toggle sidebar