Your Pod Is Using 5% CPU and Still Throttled. Here Is Why.
CPU limits in Kubernetes do not mean what you think they mean. A tour of CFS quota, the 100ms scheduling period, and why your latency spikes look nothing like CPU saturation.
Your pod's Grafana dashboard shows 5% average CPU usage. Latency p99 is twice what it should be. The on-call engineer asks "is the pod CPU-bound?" and you say no, look, it is barely using any CPU.
Then someone runs:
kubectl exec -it $POD -- cat /sys/fs/cgroup/cpu.stat | grep throttled
And sees:
nr_throttled 4823
throttled_time 142000000000
The pod has been throttled 4,823 times. It has spent 142 seconds being held off the CPU on purpose by the kernel. The dashboard says 5% utilization. Both are correct.
This is one of the most counterintuitive performance traps in Kubernetes. Your CPU limits do not mean "do not exceed this much CPU on average." They mean something stricter and more surprising. This post is what they actually mean, why your low-utilization pod is still throttled, and how to set limits so this stops happening.
What resources.limits.cpu actually does#
When you set this on a container:
resources:
limits:
cpu: "1"
You are telling the kubelet to enforce a CFS (Completely Fair Scheduler) quota on the pod's cgroup. The kubelet writes two files:
/sys/fs/cgroup/cpu.max (cgroup v2)
# or
/sys/fs/cgroup/cpu/cpu.cfs_quota_us (cgroup v1)
/sys/fs/cgroup/cpu/cpu.cfs_period_us (cgroup v1)
cpu.max (or the v1 quota+period pair) defines two values: a quota and a period. The default period is 100ms. The quota for cpu: "1" is 100ms. For cpu: "500m" it is 50ms. For cpu: "2" it is 200ms.
The contract is: in every 100ms window, the pod can use at most quota milliseconds of CPU time, summed across all cores it might use.
That is a very different statement from "the pod can use 1 CPU on average." It is a per-window cap, not an average.
Why throttling happens at low average utilization#
Imagine your pod is a Java service that spends most of its time idle but periodically does a small burst of work: handle an incoming HTTP request, do some computation, return a response.
A typical request takes 30ms of CPU. The service handles 5 requests per second. Average CPU usage is 5 * 30ms / 1000ms = 15%. Easy.
Now imagine those 5 requests arrive in the same 50ms window (a thundering herd from a downstream cron, a load balancer reconnect, a cache miss storm). 5 requests, each needing 30ms of CPU, want to run at the same time. If the pod is on a 4-core node, it could run all 5 in parallel and finish in 30ms wall-clock time, using 5 * 30ms = 150ms of total CPU time during that 50ms wall-clock window.
But your pod has limits.cpu: "1", which means a quota of 100ms per 100ms period. Once the pod has used 100ms of CPU time in the current period, the kernel hard-stops every thread in that pod and refuses to schedule them again until the next period starts. The pod is throttled.
The remaining requests sit there for up to 50ms waiting for the next period to begin. Those are 50ms of latency that have nothing to do with code complexity, GC pauses, or downstream dependencies. They are a direct artifact of the CFS quota.
Average CPU usage across the second? 15%. Throttling? Yes, painfully.
This is the trap. CPU limits enforce a peak budget, not an average. A workload that bursts above the limit in any 100ms window gets throttled, even if average usage is nowhere near the limit.
How to detect this#
Three signals to look for:
1. Throttling counter on the pod's cgroup. This is the ground truth.
# cgroup v2 (most modern distros)
kubectl exec -it $POD -- cat /sys/fs/cgroup/cpu.stat
# Look for nr_throttled (count) and throttled_usec (microseconds throttled)
# cgroup v1
kubectl exec -it $POD -- cat /sys/fs/cgroup/cpu/cpu.stat
# nr_throttled and throttled_time (nanoseconds)
If nr_throttled is increasing while the workload is running, you are being throttled.
2. Prometheus metric container_cpu_cfs_throttled_seconds_total. Exposed by cAdvisor on every node.
# Per-pod throttled rate (seconds of throttling per second of wall clock)
rate(container_cpu_cfs_throttled_seconds_total{pod="$pod"}[5m])
# Throttled fraction (closer to 1.0 means more pain)
rate(container_cpu_cfs_throttled_seconds_total{pod="$pod"}[5m])
/
rate(container_cpu_cfs_periods_total{pod="$pod"}[5m])
A throttled fraction above ~0.05 (5% of periods experiencing throttling) is a real problem in latency-sensitive services. Above 0.5 is severe.
3. Latency p99 that does not correlate with average utilization. This is the symptom that brings you to the diagnosis. CPU dashboards look fine, but tail latency is bad. The shape on a histogram is distinctive: a clean fast peak plus a smaller slow peak around 50-100ms (one CFS period of forced wait).
How to fix it#
The right fix depends on the workload. There is no single answer, but here are the four real options.
Option 1: Remove the CPU limit entirely#
The most aggressive fix and, for many workloads, the right one. Without a limits.cpu, the pod can use as much CPU as is available on the node, capped only by requests.cpu (which still controls scheduling and the relative share when the node is contended).
resources:
requests:
cpu: "500m" # for scheduling and fair-share
# no limits.cpu # bursting allowed
This is the recommendation from many SREs at scale: set requests to the average expected use, leave limits unset, let the kernel's CFS shares handle contention. The downside: a single misbehaving pod can starve neighbors on the same node. The upside: no artificial latency.
Whether this works for you depends on your tenancy model. Single-team clusters: probably fine. Multi-tenant clusters where tenants distrust each other: not safe.
Option 2: Raise the CPU limit to the burst peak, not the average#
If you must keep limits (multi-tenant, regulated, or just policy), set them based on burst behavior, not average. If your workload bursts to 4 CPU-equivalents for 30ms during request handling, you need limits.cpu: "4" or higher, even though average usage is 15%. The limit must accommodate the burst, not the average.
resources:
requests:
cpu: "500m" # average use
limits:
cpu: "4" # peak burst
This is wasteful from a quota-accounting perspective but fixes the throttling. It works because the quota now equals 400ms per 100ms period (4 cores worth), and the burst easily fits.
Option 3: Tune the CFS period#
Linux lets you change the period from the default 100ms to something shorter. With a shorter period, throttling still happens but the wait is shorter, so latency tails are smaller. This is rarely done at the pod level (Kubernetes doesn't expose it), but you can change the kubelet's --cpu-cfs-quota-period flag for cluster-wide tuning. A 10ms period instead of 100ms makes throttling waits 10x shorter.
Trade-off: more bookkeeping overhead and finer-grained scheduling decisions.
Option 4: Disable CFS quota enforcement on a node pool#
The kubelet flag --cpu-cfs-quota=false disables quota enforcement entirely on that node. Limits become advisory only. Useful for dedicated node pools running latency-sensitive workloads where you trust the workload not to monopolize.
# In the kubelet config or kube-flags
cpuCFSQuota: false
Effectively this gives every pod on that node behavior similar to "no limits set," with the difference that you can still see what limit was nominally requested (for accounting and quota tracking).
The Java problem (and other multi-threaded runtimes)#
If your workload is a JVM, Go, Node.js, or any runtime that spawns multiple threads, throttling has an extra cruelty: the limit is summed across all threads.
A Java service with 8 worker threads, all doing 30ms of work in parallel, uses 8 * 30ms = 240ms of CPU time in 30ms of wall clock. Your limits.cpu: "1" (100ms quota) is exhausted in less than 15ms. The remaining work is throttled for up to 85ms.
The runtime does not know about CFS quota. The JVM happily uses Runtime.availableProcessors() to decide how many threads to spawn (defaults to the node's core count, often 8, 16, or 32). It then runs them all in parallel and gets throttled because the cgroup quota is far smaller.
The historical fix for Java: -XX:ActiveProcessorCount=N to lie to the JVM about how many cores it has. Modern JVMs (8u192+, 11+) detect cgroup limits automatically and adjust thread pools, so this is less of a problem now, but the underlying issue remains for any runtime that does not honor cgroups.
For Go, set GOMAXPROCS to match the limit:
env:
- name: GOMAXPROCS
value: "1" # if limits.cpu is "1"
The automaxprocs library from Uber sets this from cgroup automatically.
For Node.js, the runtime is single-threaded for JavaScript code, so this is mostly fine, but worker threads and the libuv thread pool can still trigger it.
When CPU limits make sense#
I have been negative about CPU limits, but they are not always bad. They make sense when:
- You run untrusted workloads that could deliberately starve neighbors. A multi-tenant cluster with tenants that do not trust each other needs limits. Your home cluster does not.
- You charge by CPU usage and need predictable billing. Limits prevent surprise bills.
- The workload is genuinely CPU-bound and predictable. A batch processor that wants 4 cores constantly and never bursts higher: limits at "4" hurt nothing.
- You need protection against runaway processes. A buggy workload that goes into an infinite loop can be capped.
If none of these apply, consider going limit-less.
Quick reference: the throttling diagnostic checklist#
When latency is bad and CPU dashboards look fine:
1. Check throttling on the pod cgroup:
kubectl exec -it $POD -- cat /sys/fs/cgroup/cpu.stat | grep throttled
2. Plot throttled fraction in Prometheus:
rate(container_cpu_cfs_throttled_seconds_total[5m])
/ rate(container_cpu_cfs_periods_total[5m])
(above 5% = problem)
3. Inspect the limit vs the workload's burst pattern:
- Single request CPU cost?
- Concurrent requests in flight?
- Multiplier = total CPU-time per 100ms window
- Limit must accommodate the multiplier
4. Pick a fix:
- Trusted workload, single tenant: remove the limit
- Multi-tenant: raise the limit to peak burst (not average)
- Latency-critical at scale: shorten the CFS period or disable quota on dedicated nodes
- JVM/Go: ensure runtime is honoring cgroups (GOMAXPROCS, ActiveProcessorCount)
5. Re-measure throttling after the change.
The goal is not zero throttling at all costs. The goal is throttling that does not show up in your latency tail.
The mental model that fixes this#
Stop thinking of limits.cpu as "average CPU cap." Start thinking of it as "peak burst CPU cap, enforced every 100ms." Almost everything else follows from that one shift.
Average CPU is what your dashboard graphs measure. CFS quota is what the kernel enforces. They are different things. Your latency cares about the kernel's view, not the dashboard's.
This is one of dozens of performance traps covered in the Kubernetes Performance Optimization course, where we go through CPU, memory, network, and storage performance from first principles. And in the Kubernetes Debugging course, we cover how to diagnose latency mysteries like this one in production.