Kubernetes Performance Optimization

CPU Throttling: The Silent Performance Killer

Your application's p99 latency spikes every 100ms. CPU utilization is only 30%. The app team says it's a Kubernetes problem. It is, but not the kind they think.

This is the lesson that fixes more p99 latency complaints than any other in the course. CPU throttling is one of those failure modes where every surface metric looks healthy. CPU utilization is moderate, the pod is not OOMKilling, the application is not crashing, and yet user-facing latency is 10x worse than it should be. The cause hides in a counter most teams have never looked at.

By the end of this lesson, you will know exactly what is happening at the kernel level, how to detect it definitively, and how to decide whether the right fix is to raise the limit, remove the limit, or do something else entirely.

The problem

The signature of CPU throttling, in three sentences:

Average CPU utilization on the pod is moderate (often 20-40%).
P99 latency on the workload is 5-20x higher than P50.
Throttling spikes correlate exactly with latency spikes.

Engineers usually arrive at this lesson through one of two paths. Either the app team complains about latency and the platform team has been told "the app is fine, must be Kubernetes," or someone notices container_cpu_cfs_throttled_periods_total in a dashboard for the first time and realizes it is not zero.

The mental model that confuses people: "I gave the pod a 1 CPU limit and the pod is only using 0.3 CPU on average. How can it be CPU-bound?" The answer is in the word "average." CPU usage averaged over a minute can be 30% while the pod is being throttled for 30ms out of every 100ms window. The average hides the spikes; the spikes are what users feel.

KEY CONCEPT

"Average CPU is low" is not the same as "the pod is not CPU-bound." A pod can be heavily throttled while showing low average CPU because throttling happens in 100ms windows and average dilutes across longer periods. The right metric for CPU contention is throttled period rate, not utilization.

How it works under the hood

The Linux kernel uses CFS (Completely Fair Scheduler) bandwidth control to enforce CPU limits. The mechanism has two knobs and a clock:

cpu.cfs_period_us: the size of the accounting window, in microseconds. Default 100,000 (100ms).
cpu.cfs_quota_us: how much CPU time the cgroup can use within one period, in microseconds. For a 1 CPU limit, quota = period × 1 = 100,000us. For 0.5 CPU, quota = 50,000us.

When a process in the cgroup tries to run, the kernel debits its quota. When the quota hits zero, the process is throttled until the next period starts. The next period starts on a fixed wall-clock boundary, regardless of what the workload was doing.

A throttling event, 100ms window by 100ms window

Click each step to explore

The thing that makes this counterintuitive: the pod averaged about 70% CPU over the 100ms window (70ms of actual work in 100ms wall time), which sounds like efficient utilization. The user, however, saw a 53ms pause in service. If your latency SLO is 200ms p99, this single throttling event blew it.

The kernel exposes throttling via two cgroup files (cgroup v2 path):

$ cat /sys/fs/cgroup/<pod-cgroup>/cpu.stat
usage_usec 142000000
user_usec 98000000
system_usec 44000000
nr_periods 1500
nr_throttled 89
throttled_usec 4200000

nr_throttled is the number of periods where the pod hit its quota. throttled_usec is total wall-clock time spent throttled. A pod with nr_throttled / nr_periods > 0.05 (more than 5% of periods throttled) is being held back by its CPU limit, full stop.

The Prometheus metrics that surface this:

container_cpu_cfs_periods_total: counter of periods elapsed
container_cpu_cfs_throttled_periods_total: counter of periods where throttling occurred
container_cpu_cfs_throttled_seconds_total: total wall-clock time the cgroup was throttled

The diagnostic ratio is throttled_periods / periods, computed as a rate over time.

Diagnosis and measurement

The single Prometheus query that catches throttling everywhere:

# CPU throttling ratio by pod, top 20 worst offenders
topk(20,
  sum by (namespace, pod) (
    rate(container_cpu_cfs_throttled_periods_total[5m])
  )
  /
  sum by (namespace, pod) (
    rate(container_cpu_cfs_periods_total[5m])
  )
)

Any pod above 0.05 (5%) is being throttled enough to cause user-visible latency. Above 0.20 is severe.

Combine with utilization to see the "low CPU but throttling" signature:

# Pods with low utilization but high throttling
(
  sum by (namespace, pod) (rate(container_cpu_cfs_throttled_periods_total[5m]))
  /
  sum by (namespace, pod) (rate(container_cpu_cfs_periods_total[5m]))
) > 0.10
and on (namespace, pod)
(
  sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m]))
  /
  sum by (namespace, pod) (kube_pod_container_resource_limits{resource="cpu"})
) < 0.5

Pods returned by this query are throttled more than 10% of the time despite using less than 50% of their CPU limit on average. They are textbook cases for the "raise or remove the limit" treatment below.

For one specific pod, get cgroup-level data directly:

# On the node, inside the pod's cgroup
cat /sys/fs/cgroup/<path>/cpu.stat | grep -E 'nr_periods|nr_throttled|throttled_usec'

A 5-second sample tells you definitively whether throttling is happening right now.

The fix

Three real options. They differ in tradeoffs, not in technical correctness.

Option 1: Raise the CPU limit. The most conservative fix. If the pod has limits.cpu: 500m and is throttled, bump to limits.cpu: 1000m or limits.cpu: 2000m. The pod will be throttled less; latency improves. The downside is the pod can now monopolize more CPU on the node if it ever truly bursts, which matters in multi-tenant environments.

Option 2: Remove the CPU limit entirely. Keep requests.cpu (so the scheduler still bin-packs correctly) but omit limits.cpu. The pod can burst into spare CPU on the node without throttling. This produces dramatically better latency for most workloads. The downside: if the node is at full CPU utilization, multiple unlimited pods compete for CPU and one pod can starve another. In single-tenant clusters or clusters with significant spare CPU headroom, this is usually the right call.

Option 3: Use static CPU manager policy with Guaranteed QoS. The pod gets exclusive CPUs that no other pod can use. Throttling becomes irrelevant because the kernel scheduler never has to share. Reserved for latency-critical workloads (trading systems, telco, real-time processing) and only worth the operational complexity for workloads where p99 latency is a business-critical metric. Covered fully in lesson 3.4.

The right answer depends on the workload and the cluster:

Workload type	Recommended CPU limit policy
Web service on dedicated node pool, sub-200ms latency target	No `limits.cpu` (Option 2)
Multi-tenant cluster, many small workloads	`limits.cpu` set to 2-3x request (Option 1)
Latency-critical trading/telco workload	CPU Manager static policy (Option 3)
Batch job, latency-tolerant	`limits.cpu` set tight; throttling is fine
Cron job, occasional run	`limits.cpu` set; bursts are bounded

For most production-facing services on reasonably well-utilized clusters, Option 2 (no CPU limit) is correct. The conventional wisdom of "always set CPU limits for safety" is genuinely wrong for latency-sensitive workloads, and the data backs it up.

WAR STORY

A team I worked with had a payments API with requests.cpu: 200m, limits.cpu: 500m. Average CPU was 180m. P99 latency was 4 seconds against a 200ms SLO. Throttling rate was 35% of periods. We removed limits.cpu and kept the request the same. P99 dropped to 180ms within an hour. Average CPU went up slightly (the application could now actually use the spare capacity it needed). Throttling went to zero. The conversation with the security team was a single sentence: "we kept the request, so the scheduler still bin-packs; we removed the limit, so the kernel does not throttle." Lesson: CPU limits cause more production pain than they prevent for latency-sensitive workloads. Default to "no CPU limit" and add limits deliberately for specific reasons.

Before and after

A typical "removed CPU limits on the latency path" outcome:

Metric	Before (limits.cpu set)	After (limits.cpu removed)
P50 latency	35 ms	32 ms
P95 latency	280 ms	65 ms
P99 latency	4,200 ms	180 ms
Throttling rate	35% of periods	0%
Average CPU utilization	36%	42%
Node CPU contention incidents	0	0
Bill change	none	none

The "no change in cost" is the surprising part. People assume removing limits means the workload uses more capacity. In reality, the workload was using exactly the same amount of CPU, just spread across the periods correctly instead of bunched up at the start of each period and starved at the end.

Common mistakes

Setting limits.cpu by default on every pod. Causes throttling-induced latency spikes for most latency-sensitive workloads. Set deliberately, not as a habit.
Treating low CPU utilization as evidence of "no contention." Throttling causes low utilization. They are correlated, not opposites.
Diagnosing by kubectl top only. kubectl top shows utilization. Throttling is a separate counter. You need to look at the throttling metric explicitly.
Raising the limit by a tiny amount. Going from 500m to 600m rarely fixes throttling. The right move is usually 2-4x or remove entirely.
Forgetting that GC pauses and request handlers can use multiple CPUs at once. A pod with 1 CPU limit and 4 worker threads can burst to 4 CPUs of demand for a few milliseconds, blowing the quota in 25% of the wall-clock time.
Not setting requests.cpu. If you remove limits.cpu, you must keep requests.cpu set. Without it, the scheduler does not know how to bin-pack the pod and you can over-commit nodes.
Ignoring throttling because "the app team has not complained." Throttling that causes 4-second p99 latency hits real users; the app team often does not realize the problem is in the platform layer.

INTERVIEW QUESTION

A service has 30% average CPU utilization but terrible p99 latency. What's happening and how do you fix it?

Right-Sizing Workloads with VPA and Goldilocks

Continue

API Server Performance Tuning

←→ navigateM toggle sidebar