Linux Fundamentals for Engineers

cgroups v1 vs v2

A Kubernetes pod in production has memory.limit set to 4 GiB. The app inside sees 64 GiB via /proc/meminfo (the host's memory), cheerfully allocates 6 GiB of Python objects, and gets OOM-killed with exit code 137. The pod restarts, runs fine for a minute, then OOMs again. An engineer spends an afternoon convinced that the memory limit is being enforced incorrectly, that there is a memory leak, that the JVM is misbehaving — before noticing that the app's --max-memory flag is computed from /proc/meminfo and has no idea about the container's cgroup limit.

If namespaces tell a process what it can see, cgroups tell the kernel what it can use. Every resource limit you set on a container, every MemoryMax= in a systemd unit, every cpu.shares value in a Kubernetes QoS class — all of them are cgroup controls. This lesson covers what cgroups are, the difference between v1 and v2 (every Linux system from ~2022 onward runs v2), and the controllers you will actually tune: CPU, memory, and I/O.


What cgroups Do

A cgroup (control group) is a kernel-managed group of processes to which resource controls apply. The kernel tracks each cgroup's usage, enforces limits, and provides accounting.

The unit of management is a directory in a special filesystem. Create a subdirectory, echo a process PID into cgroup.procs, and that process is now a member of that cgroup. Write a value into memory.max — that is now the memory limit for every process in the cgroup. The kernel does the rest.

cgroups control:

  • CPU — how much CPU time processes in the cgroup can use, and at what priority.
  • Memory — hard and soft memory limits, swap limits, OOM behavior.
  • I/O — disk bandwidth and IOPS limits per block device.
  • PIDs — how many processes/threads the cgroup can create.
  • Devices — which device nodes the cgroup can read/write.
  • Network classifiers — tagging traffic for QoS (used with tc).
  • RDMA, hugetlb, misc — specialized controllers.

Plus accounting: even without limits, cgroups count bytes read, CPU time used, etc.

KEY CONCEPT

A cgroup is the answer to "how much of each resource is this set of processes using, and how much are they allowed?" Namespaces isolate what processes see; cgroups limit what they can do. Containers, systemd services, and Kubernetes pods all live in cgroups — the kernel uses the cgroup hierarchy to enforce limits and to report usage back up to your monitoring.


v1 vs v2: Why There Are Two

cgroups v1 (Linux 2.6.24, 2008): each controller has its own hierarchy. You could have processes in different memory/CPU cgroups, independently. This was flexible but became unwieldy — tools needed to coordinate across multiple hierarchies, controllers often disagreed about which group a process was really in.

cgroups v2 (Linux 4.5, 2016; stable ~2018; default on almost every modern distro): a single unified hierarchy. Every process is in exactly one cgroup at a time, and that cgroup has all enabled controllers. Simpler, more consistent, better integrated with systemd.

How to tell which you are on:

# The filesystem under /sys/fs/cgroup tells you
mount | grep cgroup
# v2: cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
# v1: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,... cpu,cpuacct)
#     cgroup on /sys/fs/cgroup/memory type cgroup (rw,... memory)
#     ...many mounts
# Hybrid: some v1 + /sys/fs/cgroup/unified on cgroup2

# Or look at the structure
ls /sys/fs/cgroup | head
# v2 single file tree, files like cgroup.controllers, cpu.max, memory.max, etc.
# v1 multiple directories like cpu/, memory/, blkio/, each with their own files

# The authoritative check
stat -fc %T /sys/fs/cgroup/
# cgroup2fs   (pure v2)
# tmpfs       (v1 or hybrid — look at contents)

This lesson uses v2 as the default — it is what you will encounter on Ubuntu 22.04+, Debian 11+, Fedora, RHEL 9+, Arch, and every current cloud Linux distribution. Where v1 differs significantly, we will call it out.


The v2 Hierarchy

Under /sys/fs/cgroup/, each directory is a cgroup. Directories can contain other directories (subcgroups). Every process is pinned to exactly one cgroup at any moment.

# Top-level layout on a systemd system
ls /sys/fs/cgroup
# cgroup.controllers         <- which controllers are available at root
# cgroup.procs               <- PIDs in the root cgroup
# cgroup.subtree_control     <- which controllers are enabled in children
# cpu.max, memory.max, io.max, ...   <- root-level limits (usually none)
# init.scope/                <- PID 1 (systemd itself)
# system.slice/              <- system services live here
# user.slice/                <- user sessions live here

# Explore a specific service's cgroup
ls /sys/fs/cgroup/system.slice/nginx.service/
# cgroup.controllers
# cgroup.events
# cgroup.procs               <- every PID in this cgroup
# cgroup.stat
# cpu.max                    <- CPU limit (if any)
# cpu.pressure
# cpu.stat
# cpu.weight
# io.stat
# io.max
# memory.max
# memory.min
# memory.current
# memory.events
# memory.pressure
# pids.max
# pids.current
# ...

# Every PID in this cgroup
cat /sys/fs/cgroup/system.slice/nginx.service/cgroup.procs
# 12302
# 12303
# 12304
# ...

The hierarchy mirrors systemd's structure: slices (group of services — system.slice, user.slice), scopes (adopted groups of processes), and services (one per .service unit).

Which cgroup is a process in?

cat /proc/$PID/cgroup
# 0::/system.slice/nginx.service
# or inside a container:
# 0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod....scope/cri-containerd-...scope

On v2 this is always a single line 0::/PATH because there is one unified hierarchy.


CPU Controller

The CPU controller limits and prioritizes CPU time.

cpu.max — hard quota and period

Format: <quota> <period>. In each period microseconds, the cgroup can use up to quota microseconds of CPU time.

# Limit the cgroup to 1 full CPU (100ms quota every 100ms)
echo "100000 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max

# 0.5 CPUs
echo "50000 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max

# 2 CPUs
echo "200000 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max

# Remove the limit
echo "max 100000" | sudo tee /sys/fs/cgroup/myservice/cpu.max

This is CPUQuota= in systemd and resources.limits.cpu in Kubernetes.

cpu.weight — relative priority

Range 1–10000, default 100. Under contention, cgroups are scheduled proportionally to their weight. A cgroup with weight 200 gets twice the CPU of one with weight 100 — when they are both asking for CPU at the same time. When no contention, every cgroup gets what it asks for.

# Make this service high priority
echo 500 | sudo tee /sys/fs/cgroup/high-priority-job/cpu.weight

# Low priority background work
echo 10 | sudo tee /sys/fs/cgroup/cleanup.service/cpu.weight

This is CPUWeight= in systemd and resources.requests.cpu in Kubernetes (Kubernetes converts CPU requests into cgroup weights).

cpu.stat — how much CPU have we used?

cat /sys/fs/cgroup/myservice/cpu.stat
# usage_usec 18293847
# user_usec 15201234
# system_usec 3092613
# nr_periods 1823
# nr_throttled 42                <- this cgroup was throttled 42 times
# throttled_usec 1234567           <- total time throttled
WARNING

nr_throttled climbing on a CPU-bound workload means the workload is hitting its cpu.max quota and being paused by the kernel. This is CFS-quota throttling — extremely common for latency-sensitive services that burst briefly over their limit. The fix is either to raise the limit, remove it, or rewrite the workload to not burst. Kubernetes CPU limits are the #1 source of "why is my p99 so bad?" in most clusters because they throttle at sub-second timescales.


Memory Controller

Memory is the one every engineer meets first — and gets wrong first.

memory.max — hard limit

The cgroup cannot allocate more physical memory than this value. Trying to allocate past it triggers the OOM killer scoped to the cgroup — it picks a process in this cgroup and kills it.

# Limit this cgroup to 2 GiB
echo $((2 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/myservice/memory.max
# Or in shorthand on some kernels:
# echo "2G" | sudo tee .../memory.max

This is MemoryMax= in systemd and resources.limits.memory in Kubernetes.

memory.high — soft limit (throttling)

Above this, the cgroup is throttled: the kernel spends more time on direct reclaim in its allocations, slowing the cgroup down deliberately. It does not OOM at this level — it just gets slower. Sets up backpressure before you hit the hard wall.

# Throttle above 1.5 GiB, OOM above 2 GiB
echo $((1500 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/myservice/memory.high
echo $((2048 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/myservice/memory.max

memory.min and memory.low — protection

Guarantees the cgroup at least this much memory: when the system is under pressure, pages in this cgroup are protected from reclaim up to memory.min (hard) or preferred-protection up to memory.low (soft).

This is Kubernetes resources.requests.memory territory (sort of — Kubernetes uses it to decide scheduling, not cgroup protection, but the mechanism exists).

memory.current — live usage

# Real-time memory usage
cat /sys/fs/cgroup/myservice/memory.current
# 834561024

# Detailed breakdown
cat /sys/fs/cgroup/myservice/memory.stat | head -15
# anon 524288000               <- anonymous memory (heap, stack)
# file 209715200               <- file-backed (page cache)
# kernel_stack 1048576
# slab 2097152
# ...

memory.events — OOM history

cat /sys/fs/cgroup/myservice/memory.events
# low 0
# high 128                     <- was throttled by memory.high this many times
# max 3                        <- was killed by memory.max this many times
# oom 3
# oom_kill 5                   <- total processes killed in this cgroup
WAR STORY

A team could not understand why their Kubernetes pod kept OOMing despite the pod being well under its 4 GiB limit according to kubectl top pod. Looking at memory.events in the cgroup showed oom_kill 17 — the cgroup had been OOM-killing something. Looking at memory.stat showed anonymous memory was fine, but file was rising. The pod was mmapping gigabytes of files that were counted against the cgroup. kubectl top reported working-set memory, which excluded the file-backed portion — but the kernel was counting everything, and the cgroup limit applied to everything. The fix was to raise the limit; the lesson was: do not trust kubectl top's number as "what matters for OOM." Read memory.current and memory.stat when in doubt.


I/O Controller

Limit bandwidth and IOPS per block device.

# Find the major:minor for a device
lsblk
# NAME         MAJ:MIN
# nvme0n1      259:0

# Limit reads to 50 MB/s and writes to 30 MB/s on nvme0n1
echo "259:0 rbps=50000000 wbps=30000000" | sudo tee /sys/fs/cgroup/myservice/io.max

# Limit IOPS too
echo "259:0 riops=1000 wiops=500" | sudo tee /sys/fs/cgroup/myservice/io.max

# See what's actually happening
cat /sys/fs/cgroup/myservice/io.stat
# 259:0 rbytes=1234567 wbytes=987654 rios=120 wios=45 dbytes=0 dios=0

This is IOReadBandwidthMax=, IOWriteBandwidthMax=, etc. in systemd. Kubernetes has no built-in equivalent — it delegates I/O limits to runtime-specific settings (and most do not use them).


PID Controller

Prevent fork bombs from taking down the host.

# Cap at 1024 processes/threads
echo 1024 | sudo tee /sys/fs/cgroup/myservice/pids.max

# Current count
cat /sys/fs/cgroup/myservice/pids.current
# 17

# This is `TasksMax=` in systemd and `resources.limits.pids` is not standard
# but container runtimes usually set pids.max automatically

Putting a Process in a Cgroup by Hand

Creating your own cgroup and moving a process into it takes three steps on v2:

# 1. Create the cgroup (just make a directory)
sudo mkdir /sys/fs/cgroup/experiment

# 2. Enable the controllers you want in the parent
echo "+cpu +memory +io +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# 3. Set limits
echo 200000 100000 | sudo tee /sys/fs/cgroup/experiment/cpu.max       # 2 CPUs
echo $((1 * 1024 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/experiment/memory.max
echo 256 | sudo tee /sys/fs/cgroup/experiment/pids.max

# 4. Move a shell in — every child you spawn inherits membership
echo $$ | sudo tee /sys/fs/cgroup/experiment/cgroup.procs

# 5. Try it
yes > /dev/null &
yes > /dev/null &        # runs flat out; but your shell is in "experiment", so combined 2 CPUs max

# Watch it
watch -n1 "cat /sys/fs/cgroup/experiment/cpu.stat"

# Memory stress
python3 -c 'a = "x"*2_000_000_000'    # Killed — 2 GB allocation exceeds our 1 GB memory.max

# Clean up
# Move yourself back to the root cgroup first
echo $$ | sudo tee /sys/fs/cgroup/cgroup.procs
sudo rmdir /sys/fs/cgroup/experiment

This is essentially what every container runtime does, minus a lot of bookkeeping and the namespace creation step.


Inspection Tools

systemd tools

# Live top-like view of cgroup resource usage
systemd-cgtop
# Control Group                           Tasks   %CPU   Memory   Input/s   Output/s
# /                                         527   23.5    7.2G        -        -
# system.slice                              180   12.3    3.4G        -        -
# system.slice/nginx.service                  9    0.1    8.4M
# system.slice/docker.service               120    8.7    1.2G
# user.slice/user-1000.slice                120    2.1    1.8G
# ...

# Hierarchical view
systemd-cgls

# Per-service live stats
systemctl status nginx      # shows Memory: ..., CPU: ..., CGroup:...

lscgroup and cgget (cgroup-tools package)

lscgroup                      # list all cgroups
cgget -n -r memory.current /system.slice/nginx.service

Direct /sys/fs/cgroup reads

Everything is files — cat anything.


v1 Quick Reference (If You Are Stuck With It)

You will see v1 mostly on very old systems, RHEL 7, or Kubernetes clusters running older CRI runtimes. Under v1:

  • Each controller has its own hierarchy: /sys/fs/cgroup/cpu/, /sys/fs/cgroup/memory/, /sys/fs/cgroup/blkio/, etc.
  • File names differ: memory.limit_in_bytes instead of memory.max, cpu.cfs_quota_us + cpu.cfs_period_us instead of cpu.max.
  • You put a process in each controller's hierarchy independently.
# v1 example: memory limit
echo 2G > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes

# v1 example: cpu quota
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us
echo 200000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us

If you maintain services that work on both, scripts typically probe the filesystem and branch. Most tooling has migrated to v2 paths by now.


Key Concepts Summary

  • cgroups limit and account for resources. Namespaces control views; cgroups control usage.
  • cgroups v2 is the single unified hierarchy. Every modern distro uses it. v1 exists on legacy systems and has per-controller hierarchies.
  • Every process is in exactly one cgroup on v2. cat /proc/$PID/cgroup shows the path.
  • Controllers you will touch: cpu, memory, io, pids. Each exposes a handful of files: cpu.max, cpu.weight, memory.max, memory.high, io.max, pids.max.
  • Creating a cgroup = making a directory. Adding a process = writing its PID to cgroup.procs. Setting limits = writing to control files.
  • memory.max triggers OOM; cpu.max causes throttling. Throttling is usually the invisible killer of p99.
  • memory.events tells you if OOM has ever happened in this cgroup. oom_kill count is truth.
  • systemd-cgtop is the best interactive overview. systemctl status UNIT shows per-service cgroup stats.
  • Container runtimes and systemd both create cgroups automatically. Understanding the plumbing helps you debug when limits do not behave as expected.

Common Mistakes

  • Setting aggressive cpu.max on a latency-sensitive service and then wondering why p99 doubled. CFS throttling is invisible to application timing but obvious in nr_throttled.
  • Confusing memory.current (includes page cache) with the "working set" number monitoring tools usually report. Page cache is reclaimable, but a misconfigured workload can be OOM-killed anyway because memory.max is the sum.
  • Assuming the cgroup hard-limit applies on top of swap. By default it does; set memory.swap.max explicitly if you want to allow some swap.
  • Running a Java app with -Xmx8G on a cgroup limited to 4G. The JVM does not look at the cgroup unless told to. Use -XX:+UseContainerSupport (default in JDK 11+) or set -Xmx appropriately.
  • Putting a process in a cgroup and then being surprised that children inherit membership. That is the default — which is usually what you want, but be aware.
  • Treating v1 and v2 paths as interchangeable in scripts. memory.limit_in_bytes (v1) is memory.max (v2); cpu.cfs_quota_us (v1) is half of cpu.max (v2). Probe the filesystem first.
  • Using cpu.shares (v1) or cpu.weight (v2) to "guarantee" CPU — they only matter under contention. When CPUs are idle, weights do nothing.
  • Forgetting to enable controllers in cgroup.subtree_control before creating child cgroups. The child's limits quietly have no effect because the controller is not even present.
  • Reading Kubernetes "CPU throttling" metrics and concluding the pod needs more CPU, when the real fix is to remove the CPU limit entirely and let it burst. Kubernetes CPU requests (cgroup weights) are enough in most scheduled-fairly clusters; CPU limits (cgroup quota) are almost always harmful.

KNOWLEDGE CHECK

A Kubernetes pod has requests: cpu=500m, limits: cpu=1. Under moderate load it regularly shows 15% throttled CPU time in Prometheus, even though node CPU utilization is only 40%. Application p99 latency has increased 3x since adding the CPU limit. What is happening, and what is the typical production fix?