All posts
Linux

cgroups, Pod Memory Limits, and What Actually Gets Counted

Your pod's memory limit isn't measuring what you think it is. A tour of cgroup v2 accounting and the surprises hiding inside memory.current.

By Sharon Sahadevan··8 min read

Most Kubernetes engineers have a mental model of pod memory that's about 80% right. The remaining 20% is what makes you reach for kubectl describe pod at midnight and see a number that doesn't match any of your dashboards.

This post is about that 20%. What memory.current actually contains, why it's different from RSS, why cgroup v1 and v2 disagree about reality, and what to set as a limits.memory so you stop being surprised.

The thing your pod's memory limit is actually limiting#

When you set resources.limits.memory: 4Gi on a container, the kubelet writes that value to a single file in the cgroup hierarchy. On a cgroup v2 node, that file is:

/sys/fs/cgroup/kubepods.slice/.../memory.max

And the kernel makes a promise: when the cgroup's memory.current would exceed memory.max, it will either reclaim memory from inside the cgroup or invoke the OOM killer.

The trap is in the word "memory." The kernel and your dashboards usually disagree about what it means.

What memory.current actually counts#

memory.current (cgroup v2) and memory.usage_in_bytes (cgroup v1) include:

  • Anonymous memory: heap, stack, and any MAP_ANONYMOUS mappings. This is what most people picture as "memory used."
  • Page cache attributable to the cgroup, files read or mmap'd by processes in this cgroup. Yes, including page cache pages that are clean and could be evicted instantly.
  • Kernel memory used on behalf of the cgroup (cgroup v2 only by default): slab, kernel stacks, page tables, socket buffers. This was opt-in in v1 via kmem.limit_in_bytes.
  • Shmem: shared memory segments and tmpfs mounts.
  • Swap: separately tracked in memory.swap.current (v2) but counted toward overall memory pressure.

Note what is not included: memory used by other cgroups, even if your processes triggered the kernel to allocate it indirectly. And note what is included that surprises people: every cat /var/log/whatever your container does increases memory.current even if your application's RSS doesn't budge.

KEY CONCEPT

Page cache from your container counts toward memory.current, even though it's reclaimable. If your limits.memory is tight and your container reads a lot of files, you can be OOMKilled despite your application's actual heap being well under the limit. The kernel will try to reclaim page cache first, but only if it has time before the limit is breached.

Why your Grafana dashboard lies#

Your dashboard probably shows container_memory_working_set_bytes (cAdvisor's WSS metric). This is not the same as memory.current. cAdvisor computes it as:

working_set = memory.current - inactive_file

Where inactive_file is the amount of file-backed page cache the kernel considers cold. The intent is to approximate "memory the kernel can't trivially reclaim", closer to what an SRE wants to alert on.

But the kubelet's OOM killer doesn't care about working set. It cares about memory.current vs memory.max. So your alert at 80% of working set might fire long after the cgroup is on the verge of OOMKilling.

The lesson: alert on container_memory_working_set_bytes for early warning, but read container_memory_usage_bytes (which is memory.current directly) when you actually want to know why a pod was killed.

cgroup v1 vs v2: the differences that bite#

If your cluster spans nodes with different cgroup versions (RHEL 8 vs Ubuntu 24.04, for example), the same workload behaves differently on each.

Behaviorcgroup v1cgroup v2
Default kernel memory accountingOff (opt-in via kmem.limit_in_bytes)On, always
OOM killer scopePer cgroup, picks highest oom_scorePer cgroup, can be configured to kill the whole cgroup atomically (memory.oom.group)
PSI (pressure stall info)Not availableAvailable at memory.pressure
Swap accountingCombined with memory in memory.memsw.usage_in_bytesSeparate at memory.swap.current
Memory limit filememory.limit_in_bytesmemory.max
Soft limitmemory.soft_limit_in_bytesmemory.high (with reclaim throttling, much more useful)

The most common surprise: a workload that ran fine on a v1 node OOMKills on a v2 node with identical limits, because v2 includes kernel memory in memory.current by default. If your pod uses a lot of sockets, large numbers of FDs, or many threads (each thread has a kernel stack), you can see anywhere from a few MiB to several hundred MiB of additional accounted memory on v2.

WAR STORY

An egress-heavy gRPC service ran fine for two years on cgroup v1 nodes with limits.memory: 1Gi. Migrated to cgroup v2 (Ubuntu 22.04 → 24.04 node pool refresh), and within a day half the pods were in CrashLoopBackOff. The application's RSS hadn't changed. The kernel memory for the ~80,000 sockets it was holding was now being charged to the cgroup. Fix: limits.memory: 1.4Gi. Lesson: v1 → v2 migrations need a memory budget review.

Reading /sys/fs/cgroup like an SRE#

When you need ground truth, go to the source. Get a shell on the node (or use a debug pod with nsenter) and find the cgroup path for your container:

# Find the container ID
crictl ps --name my-container

# Get the cgroup path
crictl inspect <container-id> | jq -r '.info.runtimeSpec.linux.cgroupsPath'

The path looks like kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<container-id>.scope. Cd to /sys/fs/cgroup/<that path>/ and you have direct access to:

$ cat memory.current
3892412416

$ cat memory.max
4294967296

$ cat memory.events
low 0
high 0
max 12
oom 1
oom_kill 1
oom_group_kill 0

$ cat memory.stat | head -10
anon 2147483648
file 1610612736
kernel 134217728
kernel_stack 16777216
pagetables 8388608
percpu 0
sock 4194304
shmem 0
file_mapped 268435456
file_dirty 0

Two files do most of the work:

  • memory.events, counters for low, high, max (memory pressure thresholds tripped) and oom/oom_kill (whether the killer fired and whether it actually killed something). If oom_kill is 1, your container has been OOMKilled at least once during its lifetime.
  • memory.stat, the breakdown of where memory is going. anon is your application's heap, file is page cache, kernel and kernel_stack are the kernel's overhead, sock is socket buffers.

This is also exactly the data that powers kubectl debug workflows. The rabbit hole, including how to map /sys/fs/cgroup paths back to specific pods at scale across a cluster, is what we cover in Linux Fundamentals for Engineers.

Setting limits that match reality#

A practical procedure for picking memory limits without guessing:

  1. Run the workload under realistic load. Synthetic benchmarks systematically under-report memory because they don't exercise the same code paths.
  2. Read the steady-state memory.current after a soak period. Twenty-four hours minimum, ideally a week. Allocators, caches, and connection pools take time to reach equilibrium.
  3. Add headroom for peak. P99 memory is often 1.5-2× steady state for traffic-driven workloads.
  4. Add 10-20% for the kernel overhead you can't easily measure: slab, kernel stacks (one per OS thread), socket buffers, page tables. This is the headroom that disappears when you forget about it.
  5. Set requests to steady state, limits to peak + overhead. Don't set them equal unless you genuinely want Guaranteed QoS.

For a long-running JVM, Python, or Node service, that procedure usually lands you within 10% of the right number. For batch workloads with bursty memory profiles, you want requests low and limits generous, because the cluster's bin-packing benefits more from loose requests than tight limits.

The full systematic playbook for memory debugging: App layer, Pod layer, Node layer, all the way out to the cloud provider's hypervisor, is in Kubernetes Debugging for SREs. It pairs well with this post, and it's the curriculum I wish existed when I started running production K8s.

A few traps not covered above#

  • Page cache can be reclaimed by the kernel before OOMKill, but not always in time. On a node under heavy memory pressure, the reclaim path itself runs slowly. You can OOMKill with 200 MiB of clean page cache still in memory.current.
  • Sidecar containers share the pod's memory budget at the pod level (limits is per-container, but the pod's overall cgroup is the sum). A noisy logging sidecar can push the pod over the line.
  • Init containers do not count toward the running pod's memory (they've already exited), but you should still set sensible limits on them to avoid surprising the scheduler at startup.
  • emptyDir volumes backed by tmpfs count against the pod's memory. If you write a 2 GiB file to a tmpfs emptyDir, that's 2 GiB charged to the pod's cgroup.

If this kind of detail is your jam, the Kubenatives newsletter ships a similar post most weeks, usually one production gotcha and a deep-dive on the kernel or Kubernetes internal that explains it. ~3,500 engineers read it; you can subscribe at the link.

The next time OOMKilled shows up in your alerts, skip the reflex of bumping the limit. Pull memory.stat, look at where the bytes are actually going, and fix the right thing.