cgroups v1 vs v2
A Kubernetes pod in production has
memory.limitset to 4 GiB. The app inside sees 64 GiB via/proc/meminfo(the host's memory), cheerfully allocates 6 GiB of Python objects, and gets OOM-killed with exit code 137. The pod restarts, runs fine for a minute, then OOMs again. An engineer spends an afternoon convinced that the memory limit is being enforced incorrectly, that there is a memory leak, that the JVM is misbehaving — before noticing that the app's--max-memoryflag is computed from/proc/meminfoand has no idea about the container's cgroup limit.If namespaces tell a process what it can see, cgroups tell the kernel what it can use. Every resource limit you set on a container, every
MemoryMax=in a systemd unit, everycpu.sharesvalue in a Kubernetes QoS class — all of them are cgroup controls. This lesson covers what cgroups are, the difference between v1 and v2 (every Linux system from ~2022 onward runs v2), and the controllers you will actually tune: CPU, memory, and I/O.
What cgroups Do
A cgroup (control group) is a kernel-managed group of processes to which resource controls apply. The kernel tracks each cgroup's usage, enforces limits, and provides accounting.
The unit of management is a directory in a special filesystem. Create a subdirectory, echo a process PID into cgroup.procs, and that process is now a member of that cgroup. Write a value into memory.max — that is now the memory limit for every process in the cgroup. The kernel does the rest.
cgroups control:
- CPU — how much CPU time processes in the cgroup can use, and at what priority.
- Memory — hard and soft memory limits, swap limits, OOM behavior.
- I/O — disk bandwidth and IOPS limits per block device.
- PIDs — how many processes/threads the cgroup can create.
- Devices — which device nodes the cgroup can read/write.
- Network classifiers — tagging traffic for QoS (used with
tc). - RDMA, hugetlb, misc — specialized controllers.
Plus accounting: even without limits, cgroups count bytes read, CPU time used, etc.
A cgroup is the answer to "how much of each resource is this set of processes using, and how much are they allowed?" Namespaces isolate what processes see; cgroups limit what they can do. Containers, systemd services, and Kubernetes pods all live in cgroups — the kernel uses the cgroup hierarchy to enforce limits and to report usage back up to your monitoring.
v1 vs v2: Why There Are Two
cgroups v1 (Linux 2.6.24, 2008): each controller has its own hierarchy. You could have processes in different memory/CPU cgroups, independently. This was flexible but became unwieldy — tools needed to coordinate across multiple hierarchies, controllers often disagreed about which group a process was really in.
cgroups v2 (Linux 4.5, 2016; stable ~2018; default on almost every modern distro): a single unified hierarchy. Every process is in exactly one cgroup at a time, and that cgroup has all enabled controllers. Simpler, more consistent, better integrated with systemd.
How to tell which you are on:
# The filesystem under /sys/fs/cgroup tells you
mount | grep cgroup
# v2: cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
# v1: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,... cpu,cpuacct)
# cgroup on /sys/fs/cgroup/memory type cgroup (rw,... memory)
# ...many mounts
# Hybrid: some v1 + /sys/fs/cgroup/unified on cgroup2
# Or look at the structure
ls /sys/fs/cgroup | head
# v2 single file tree, files like cgroup.controllers, cpu.max, memory.max, etc.
# v1 multiple directories like cpu/, memory/, blkio/, each with their own files
# The authoritative check
stat -fc %T /sys/fs/cgroup/
# cgroup2fs (pure v2)
# tmpfs (v1 or hybrid — look at contents)
This lesson uses v2 as the default — it is what you will encounter on Ubuntu 22.04+, Debian 11+, Fedora, RHEL 9+, Arch, and every current cloud Linux distribution. Where v1 differs significantly, we will call it out.
The v2 Hierarchy
Under /sys/fs/cgroup/, each directory is a cgroup. Directories can contain other directories (subcgroups). Every process is pinned to exactly one cgroup at any moment.
# Top-level layout on a systemd system
ls /sys/fs/cgroup
# cgroup.controllers <- which controllers are available at root
# cgroup.procs <- PIDs in the root cgroup
# cgroup.subtree_control <- which controllers are enabled in children
# cpu.max, memory.max, io.max, ... <- root-level limits (usually none)
# init.scope/ <- PID 1 (systemd itself)
# system.slice/ <- system services live here
# user.slice/ <- user sessions live here
# Explore a specific service's cgroup
ls /sys/fs/cgroup/system.slice/nginx.service/
# cgroup.controllers
# cgroup.events
# cgroup.procs <- every PID in this cgroup
# cgroup.stat
# cpu.max <- CPU limit (if any)
# cpu.pressure
# cpu.stat
# cpu.weight
# io.stat
# io.max
# memory.max
# memory.min
# memory.current
# memory.events
# memory.pressure
# pids.max
# pids.current
# ...
# Every PID in this cgroup
cat /sys/fs/cgroup/system.slice/nginx.service/cgroup.procs
# 12302
# 12303
# 12304
# ...
The hierarchy mirrors systemd's structure: slices (group of services — system.slice, user.slice), scopes (adopted groups of processes), and services (one per .service unit).
Which cgroup is a process in?
cat /proc/$PID/cgroup
# 0::/system.slice/nginx.service
# or inside a container:
# 0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod....scope/cri-containerd-...scope
On v2 this is always a single line 0::/PATH because there is one unified hierarchy.
CPU Controller
The CPU controller limits and prioritizes CPU time.
cpu.max — hard quota and period
Format: <quota> <period>. In each period microseconds, the cgroup can use up to quota microseconds of CPU time.
# Limit the cgroup to 1 full CPU (100ms quota every 100ms)
echo "100000 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max
# 0.5 CPUs
echo "50000 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max
# 2 CPUs
echo "200000 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max
# Remove the limit
echo "max 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max
This is CPUQuota= in systemd and resources.limits.cpu in Kubernetes.
cpu.weight — relative priority
Range 1–10000, default 100. Under contention, cgroups are scheduled proportionally to their weight. A cgroup with weight 200 gets twice the CPU of one with weight 100 — when they are both asking for CPU at the same time. When no contention, every cgroup gets what it asks for.
# Make this service high priority
echo 500 | sudo tee /sys/fs/cgroup/high-priority-job/cpu.weight
# Low priority background work
echo 10 | sudo tee /sys/fs/cgroup/cleanup.service/cpu.weight
This is CPUWeight= in systemd and resources.requests.cpu in Kubernetes (Kubernetes converts CPU requests into cgroup weights).
cpu.stat — how much CPU have we used?
cat /sys/fs/cgroup/myservice/cpu.stat
# usage_usec 18293847
# user_usec 15201234
# system_usec 3092613
# nr_periods 1823
# nr_throttled 42 <- this cgroup was throttled 42 times
# throttled_usec 1234567 <- total time throttled
nr_throttled climbing on a CPU-bound workload means the workload is hitting its cpu.max quota and being paused by the kernel. This is CFS-quota throttling — extremely common for latency-sensitive services that burst briefly over their limit. The fix is either to raise the limit, remove it, or rewrite the workload to not burst. Kubernetes CPU limits are the #1 source of "why is my p99 so bad?" in most clusters because they throttle at sub-second timescales.
Memory Controller
Memory is the one every engineer meets first — and gets wrong first.
memory.max — hard limit
The cgroup cannot allocate more physical memory than this value. Trying to allocate past it triggers the OOM killer scoped to the cgroup — it picks a process in this cgroup and kills it.
# Limit this cgroup to 2 GiB
echo $((2 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/myservice/memory.max
# Or in shorthand on some kernels:
# echo "2G" | sudo tee .../memory.max
This is MemoryMax= in systemd and resources.limits.memory in Kubernetes.
memory.high — soft limit (throttling)
Above this, the cgroup is throttled: the kernel spends more time on direct reclaim in its allocations, slowing the cgroup down deliberately. It does not OOM at this level — it just gets slower. Sets up backpressure before you hit the hard wall.
# Throttle above 1.5 GiB, OOM above 2 GiB
echo $((1500 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/myservice/memory.high
echo $((2048 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/myservice/memory.max
memory.min and memory.low — protection
Guarantees the cgroup at least this much memory: when the system is under pressure, pages in this cgroup are protected from reclaim up to memory.min (hard) or preferred-protection up to memory.low (soft).
This is Kubernetes resources.requests.memory territory (sort of — Kubernetes uses it to decide scheduling, not cgroup protection, but the mechanism exists).
memory.current — live usage
# Real-time memory usage
cat /sys/fs/cgroup/myservice/memory.current
# 834561024
# Detailed breakdown
cat /sys/fs/cgroup/myservice/memory.stat | head -15
# anon 524288000 <- anonymous memory (heap, stack)
# file 209715200 <- file-backed (page cache)
# kernel_stack 1048576
# slab 2097152
# ...
memory.events — OOM history
cat /sys/fs/cgroup/myservice/memory.events
# low 0
# high 128 <- was throttled by memory.high this many times
# max 3 <- was killed by memory.max this many times
# oom 3
# oom_kill 5 <- total processes killed in this cgroup
A team could not understand why their Kubernetes pod kept OOMing despite the pod being well under its 4 GiB limit according to kubectl top pod. Looking at memory.events in the cgroup showed oom_kill 17 — the cgroup had been OOM-killing something. Looking at memory.stat showed anonymous memory was fine, but file was rising. The pod was mmapping gigabytes of files that were counted against the cgroup. kubectl top reported working-set memory, which excluded the file-backed portion — but the kernel was counting everything, and the cgroup limit applied to everything. The fix was to raise the limit; the lesson was: do not trust kubectl top's number as "what matters for OOM." Read memory.current and memory.stat when in doubt.
I/O Controller
Limit bandwidth and IOPS per block device.
# Find the major:minor for a device
lsblk
# NAME MAJ:MIN
# nvme0n1 259:0
# Limit reads to 50 MB/s and writes to 30 MB/s on nvme0n1
echo "259:0 rbps=50000000 wbps=30000000" | sudo tee /sys/fs/cgroup/myservice/io.max
# Limit IOPS too
echo "259:0 riops=1000 wiops=500" | sudo tee /sys/fs/cgroup/myservice/io.max
# See what's actually happening
cat /sys/fs/cgroup/myservice/io.stat
# 259:0 rbytes=1234567 wbytes=987654 rios=120 wios=45 dbytes=0 dios=0
This is IOReadBandwidthMax=, IOWriteBandwidthMax=, etc. in systemd. Kubernetes has no built-in equivalent — it delegates I/O limits to runtime-specific settings (and most do not use them).
PID Controller
Prevent fork bombs from taking down the host.
# Cap at 1024 processes/threads
echo 1024 | sudo tee /sys/fs/cgroup/myservice/pids.max
# Current count
cat /sys/fs/cgroup/myservice/pids.current
# 17
# This is `TasksMax=` in systemd and `resources.limits.pids` is not standard
# but container runtimes usually set pids.max automatically
Putting a Process in a Cgroup by Hand
Creating your own cgroup and moving a process into it takes three steps on v2:
# 1. Create the cgroup (just make a directory)
sudo mkdir /sys/fs/cgroup/experiment
# 2. Enable the controllers you want in the parent
echo "+cpu +memory +io +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
# 3. Set limits
echo 200000 100000 | sudo tee /sys/fs/cgroup/experiment/cpu.max # 2 CPUs
echo $((1 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/experiment/memory.max
echo 256 | sudo tee /sys/fs/cgroup/experiment/pids.max
# 4. Move a shell in — every child you spawn inherits membership
echo $$ | sudo tee /sys/fs/cgroup/experiment/cgroup.procs
# 5. Try it
yes > /dev/null &
yes > /dev/null & # runs flat out; but your shell is in "experiment", so combined 2 CPUs max
# Watch it
watch -n1 "cat /sys/fs/cgroup/experiment/cpu.stat"
# Memory stress
python3 -c 'a = "x"*2_000_000_000' # Killed — 2 GB allocation exceeds our 1 GB memory.max
# Clean up
# Move yourself back to the root cgroup first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/experiment
This is essentially what every container runtime does, minus a lot of bookkeeping and the namespace creation step.
Inspection Tools
systemd tools
# Live top-like view of cgroup resource usage
systemd-cgtop
# Control Group Tasks %CPU Memory Input/s Output/s
# / 527 23.5 7.2G - -
# system.slice 180 12.3 3.4G - -
# system.slice/nginx.service 9 0.1 8.4M
# system.slice/docker.service 120 8.7 1.2G
# user.slice/user-1000.slice 120 2.1 1.8G
# ...
# Hierarchical view
systemd-cgls
# Per-service live stats
systemctl status nginx # shows Memory: ..., CPU: ..., CGroup:...
lscgroup and cgget (cgroup-tools package)
lscgroup # list all cgroups
cgget -n -r memory.current /system.slice/nginx.service
Direct /sys/fs/cgroup reads
Everything is files — cat anything.
v1 Quick Reference (If You Are Stuck With It)
You will see v1 mostly on very old systems, RHEL 7, or Kubernetes clusters running older CRI runtimes. Under v1:
- Each controller has its own hierarchy:
/sys/fs/cgroup/cpu/,/sys/fs/cgroup/memory/,/sys/fs/cgroup/blkio/, etc. - File names differ:
memory.limit_in_bytesinstead ofmemory.max,cpu.cfs_quota_us+cpu.cfs_period_usinstead ofcpu.max. - You put a process in each controller's hierarchy independently.
# v1 example: memory limit
echo 2G > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes
# v1 example: cpu quota
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us
echo 200000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us
If you maintain services that work on both, scripts typically probe the filesystem and branch. Most tooling has migrated to v2 paths by now.
Key Concepts Summary
- cgroups limit and account for resources. Namespaces control views; cgroups control usage.
- cgroups v2 is the single unified hierarchy. Every modern distro uses it. v1 exists on legacy systems and has per-controller hierarchies.
- Every process is in exactly one cgroup on v2.
cat /proc/$PID/cgroupshows the path. - Controllers you will touch: cpu, memory, io, pids. Each exposes a handful of files:
cpu.max,cpu.weight,memory.max,memory.high,io.max,pids.max. - Creating a cgroup = making a directory. Adding a process = writing its PID to
cgroup.procs. Setting limits = writing to control files. memory.maxtriggers OOM;cpu.maxcauses throttling. Throttling is usually the invisible killer of p99.memory.eventstells you if OOM has ever happened in this cgroup.oom_killcount is truth.systemd-cgtopis the best interactive overview.systemctl status UNITshows per-service cgroup stats.- Container runtimes and systemd both create cgroups automatically. Understanding the plumbing helps you debug when limits do not behave as expected.
Common Mistakes
- Setting aggressive
cpu.maxon a latency-sensitive service and then wondering why p99 doubled. CFS throttling is invisible to application timing but obvious innr_throttled. - Confusing
memory.current(includes page cache) with the "working set" number monitoring tools usually report. Page cache is reclaimable, but a misconfigured workload can be OOM-killed anyway becausememory.maxis the sum. - Assuming the cgroup hard-limit applies on top of swap. By default it does; set
memory.swap.maxexplicitly if you want to allow some swap. - Running a Java app with
-Xmx8Gon a cgroup limited to 4G. The JVM does not look at the cgroup unless told to. Use-XX:+UseContainerSupport(default in JDK 11+) or set-Xmxappropriately. - Putting a process in a cgroup and then being surprised that children inherit membership. That is the default — which is usually what you want, but be aware.
- Treating v1 and v2 paths as interchangeable in scripts.
memory.limit_in_bytes(v1) ismemory.max(v2);cpu.cfs_quota_us(v1) is half ofcpu.max(v2). Probe the filesystem first. - Using
cpu.shares(v1) orcpu.weight(v2) to "guarantee" CPU — they only matter under contention. When CPUs are idle, weights do nothing. - Forgetting to enable controllers in
cgroup.subtree_controlbefore creating child cgroups. The child's limits quietly have no effect because the controller is not even present. - Reading Kubernetes "CPU throttling" metrics and concluding the pod needs more CPU, when the real fix is to remove the CPU limit entirely and let it burst. Kubernetes CPU requests (cgroup weights) are enough in most scheduled-fairly clusters; CPU limits (cgroup quota) are almost always harmful.
A Kubernetes pod has requests: cpu=500m, limits: cpu=1. Under moderate load it regularly shows 15% throttled CPU time in Prometheus, even though node CPU utilization is only 40%. Application p99 latency has increased 3x since adding the CPU limit. What is happening, and what is the typical production fix?