cgroups, Pod Memory Limits, and What Actually Gets Counted
Your pod's memory limit isn't measuring what you think it is. A tour of cgroup v2 accounting and the surprises hiding inside memory.current.
Most Kubernetes engineers have a mental model of pod memory that's about 80% right. The remaining 20% is what makes you reach for kubectl describe pod at midnight and see a number that doesn't match any of your dashboards.
This post is about that 20%. What memory.current actually contains, why it's different from RSS, why cgroup v1 and v2 disagree about reality, and what to set as a limits.memory so you stop being surprised.
The thing your pod's memory limit is actually limiting#
When you set resources.limits.memory: 4Gi on a container, the kubelet writes that value to a single file in the cgroup hierarchy. On a cgroup v2 node, that file is:
/sys/fs/cgroup/kubepods.slice/.../memory.max
And the kernel makes a promise: when the cgroup's memory.current would exceed memory.max, it will either reclaim memory from inside the cgroup or invoke the OOM killer.
The trap is in the word "memory." The kernel and your dashboards usually disagree about what it means.
What memory.current actually counts#
memory.current (cgroup v2) and memory.usage_in_bytes (cgroup v1) include:
- Anonymous memory: heap, stack, and any
MAP_ANONYMOUSmappings. This is what most people picture as "memory used." - Page cache attributable to the cgroup, files read or mmap'd by processes in this cgroup. Yes, including page cache pages that are clean and could be evicted instantly.
- Kernel memory used on behalf of the cgroup (cgroup v2 only by default): slab, kernel stacks, page tables, socket buffers. This was opt-in in v1 via
kmem.limit_in_bytes. - Shmem: shared memory segments and tmpfs mounts.
- Swap: separately tracked in
memory.swap.current(v2) but counted toward overall memory pressure.
Note what is not included: memory used by other cgroups, even if your processes triggered the kernel to allocate it indirectly. And note what is included that surprises people: every cat /var/log/whatever your container does increases memory.current even if your application's RSS doesn't budge.
Page cache from your container counts toward memory.current, even though it's reclaimable. If your limits.memory is tight and your container reads a lot of files, you can be OOMKilled despite your application's actual heap being well under the limit. The kernel will try to reclaim page cache first, but only if it has time before the limit is breached.
Why your Grafana dashboard lies#
Your dashboard probably shows container_memory_working_set_bytes (cAdvisor's WSS metric). This is not the same as memory.current. cAdvisor computes it as:
working_set = memory.current - inactive_file
Where inactive_file is the amount of file-backed page cache the kernel considers cold. The intent is to approximate "memory the kernel can't trivially reclaim", closer to what an SRE wants to alert on.
But the kubelet's OOM killer doesn't care about working set. It cares about memory.current vs memory.max. So your alert at 80% of working set might fire long after the cgroup is on the verge of OOMKilling.
The lesson: alert on container_memory_working_set_bytes for early warning, but read container_memory_usage_bytes (which is memory.current directly) when you actually want to know why a pod was killed.
cgroup v1 vs v2: the differences that bite#
If your cluster spans nodes with different cgroup versions (RHEL 8 vs Ubuntu 24.04, for example), the same workload behaves differently on each.
| Behavior | cgroup v1 | cgroup v2 |
|---|---|---|
| Default kernel memory accounting | Off (opt-in via kmem.limit_in_bytes) | On, always |
| OOM killer scope | Per cgroup, picks highest oom_score | Per cgroup, can be configured to kill the whole cgroup atomically (memory.oom.group) |
| PSI (pressure stall info) | Not available | Available at memory.pressure |
| Swap accounting | Combined with memory in memory.memsw.usage_in_bytes | Separate at memory.swap.current |
| Memory limit file | memory.limit_in_bytes | memory.max |
| Soft limit | memory.soft_limit_in_bytes | memory.high (with reclaim throttling, much more useful) |
The most common surprise: a workload that ran fine on a v1 node OOMKills on a v2 node with identical limits, because v2 includes kernel memory in memory.current by default. If your pod uses a lot of sockets, large numbers of FDs, or many threads (each thread has a kernel stack), you can see anywhere from a few MiB to several hundred MiB of additional accounted memory on v2.
An egress-heavy gRPC service ran fine for two years on cgroup v1 nodes with limits.memory: 1Gi. Migrated to cgroup v2 (Ubuntu 22.04 → 24.04 node pool refresh), and within a day half the pods were in CrashLoopBackOff. The application's RSS hadn't changed. The kernel memory for the ~80,000 sockets it was holding was now being charged to the cgroup. Fix: limits.memory: 1.4Gi. Lesson: v1 → v2 migrations need a memory budget review.
Reading /sys/fs/cgroup like an SRE#
When you need ground truth, go to the source. Get a shell on the node (or use a debug pod with nsenter) and find the cgroup path for your container:
# Find the container ID
crictl ps --name my-container
# Get the cgroup path
crictl inspect <container-id> | jq -r '.info.runtimeSpec.linux.cgroupsPath'
The path looks like kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<container-id>.scope. Cd to /sys/fs/cgroup/<that path>/ and you have direct access to:
$ cat memory.current
3892412416
$ cat memory.max
4294967296
$ cat memory.events
low 0
high 0
max 12
oom 1
oom_kill 1
oom_group_kill 0
$ cat memory.stat | head -10
anon 2147483648
file 1610612736
kernel 134217728
kernel_stack 16777216
pagetables 8388608
percpu 0
sock 4194304
shmem 0
file_mapped 268435456
file_dirty 0
Two files do most of the work:
memory.events, counters forlow,high,max(memory pressure thresholds tripped) andoom/oom_kill(whether the killer fired and whether it actually killed something). Ifoom_killis 1, your container has been OOMKilled at least once during its lifetime.memory.stat, the breakdown of where memory is going.anonis your application's heap,fileis page cache,kernelandkernel_stackare the kernel's overhead,sockis socket buffers.
This is also exactly the data that powers kubectl debug workflows. The rabbit hole, including how to map /sys/fs/cgroup paths back to specific pods at scale across a cluster, is what we cover in Linux Fundamentals for Engineers.
Setting limits that match reality#
A practical procedure for picking memory limits without guessing:
- Run the workload under realistic load. Synthetic benchmarks systematically under-report memory because they don't exercise the same code paths.
- Read the steady-state
memory.currentafter a soak period. Twenty-four hours minimum, ideally a week. Allocators, caches, and connection pools take time to reach equilibrium. - Add headroom for peak. P99 memory is often 1.5-2× steady state for traffic-driven workloads.
- Add 10-20% for the kernel overhead you can't easily measure: slab, kernel stacks (one per OS thread), socket buffers, page tables. This is the headroom that disappears when you forget about it.
- Set
requeststo steady state,limitsto peak + overhead. Don't set them equal unless you genuinely wantGuaranteedQoS.
For a long-running JVM, Python, or Node service, that procedure usually lands you within 10% of the right number. For batch workloads with bursty memory profiles, you want requests low and limits generous, because the cluster's bin-packing benefits more from loose requests than tight limits.
The full systematic playbook for memory debugging: App layer, Pod layer, Node layer, all the way out to the cloud provider's hypervisor, is in Kubernetes Debugging for SREs. It pairs well with this post, and it's the curriculum I wish existed when I started running production K8s.
A few traps not covered above#
- Page cache can be reclaimed by the kernel before OOMKill, but not always in time. On a node under heavy memory pressure, the reclaim path itself runs slowly. You can OOMKill with 200 MiB of clean page cache still in
memory.current. - Sidecar containers share the pod's memory budget at the pod level (
limitsis per-container, but the pod's overall cgroup is the sum). A noisy logging sidecar can push the pod over the line. - Init containers do not count toward the running pod's memory (they've already exited), but you should still set sensible limits on them to avoid surprising the scheduler at startup.
emptyDirvolumes backed bytmpfscount against the pod's memory. If you write a 2 GiB file to a tmpfsemptyDir, that's 2 GiB charged to the pod's cgroup.
If this kind of detail is your jam, the Kubenatives newsletter ships a similar post most weeks, usually one production gotcha and a deep-dive on the kernel or Kubernetes internal that explains it. ~3,500 engineers read it; you can subscribe at the link.
The next time OOMKilled shows up in your alerts, skip the reflex of bumping the limit. Pull memory.stat, look at where the bytes are actually going, and fix the right thing.