The Three Primitives — Namespaces, cgroups, chroot
A junior on the team asks: "What does Docker do? Like, if I ran
docker run alpine sh, what is actually happening on the host?" The senior pauses, opens a terminal, and types three commands —unshare,pivot_root,echo PID > cgroup.procs. Thirty seconds later there is a new shell that thinks it is PID 1, sees a different filesystem, and is capped at one CPU. "That," the senior says, "is what Docker does. Those three things. Everything else is plumbing."Every container runtime — Docker, containerd, runc, podman, Kubernetes' CRI implementations — is built on three Linux primitives: namespaces (isolation), cgroups (resource limits), and a pivoted rootfs (filesystem view). Once you have seen how they compose, "what is a container" goes from mystery to recipe. This lesson walks through all three, how they interlock, and how Docker invokes them for you when you type
docker run.
The Recipe, in One Page
A container is:
- A process (spawned via
clone())... - ...placed in new namespaces (to isolate its view)...
- ...with its root filesystem pivoted (so its
/is the image's rootfs, not the host's)... - ...attached to a cgroup (to limit CPU, memory, I/O)...
- ...with capabilities dropped and optionally a seccomp filter (to limit what syscalls it can make).
That is it. Every container runtime does those five steps. The Linux kernel provides all of them directly — no virtualization, no emulation. This is why you could build a container by hand in Module 5 of the Linux course, and it is why Docker is "just" a user-friendly wrapper.
Docker did not invent containers. LXC had most of this before Docker existed; FreeBSD jails had a subset in 2000; Solaris zones had it in 2004. What Docker invented was a friendly CLI (docker run), a standard image format, and a registry for distribution. The isolation primitives are Linux kernel features, not Docker features — which is why Docker can be replaced with containerd, runc, or podman without changing how containers behave.
Primitive 1: Namespaces — "The Kernel Lies to the Process"
Namespaces isolate what a process can see. Eight kinds (covered in full in the Linux course, Module 5), of which Docker uses seven by default:
| Namespace | What it isolates | What the container sees vs the host |
|---|---|---|
mnt | Mount table | Own /, /etc, /var — the image's rootfs |
pid | PID number space | Container's main process is PID 1 |
net | Network interfaces, routes, sockets | Own eth0 (or none), own routing table |
uts | Hostname + NIS domain | Container has its own hostname |
ipc | System V IPC + POSIX message queues | No shared memory with host processes |
user | UID/GID mappings (optional) | Container's root can be unprivileged on host |
cgroup | View into /sys/fs/cgroup | Cgroup paths rooted at the container's cgroup |
Every docker run creates a new set of these and puts the container's process in them.
See namespaces at work
# Start a container
docker run -d --name demo alpine sleep 1000
PID=$(docker inspect --format='{{.State.Pid}}' demo)
echo "Container PID on host: $PID"
# Inspect the namespaces the container is in
ls -l /proc/$PID/ns | head
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532421]'
# lrwxrwxrwx 1 root root 0 ... net -> 'net:[4026532422]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532423]'
# ...
# Each inode number is a distinct namespace.
# Compare with host PID 1's namespaces (systemd on most systems)
diff <(ls -l /proc/1/ns | awk '{print $9, $11}') \
<(ls -l /proc/$PID/ns | awk '{print $9, $11}')
# Differs on: mnt, net, pid, uts, ipc, (possibly user, cgroup)
# Same on: (varies by runtime)
# Hop into the container's namespaces with nsenter — no docker exec needed
sudo nsenter -t $PID -n ip addr
# eth0@if8 ... 172.17.0.2/16 <- the container's network view
sudo nsenter -t $PID -m ls /
# bin etc lib usr ... <- the container's filesystem view
docker rm -f demo
nsenter -t <host-pid> -n <cmd> is the single most useful command for debugging container networking from the host. You get host-side tools (ip, ss, tcpdump, ping) with the container's network view. No need to install anything inside the minimal container image — you already have everything on the host.
PID 1 inside the container
The container's main process sees itself as PID 1 because it is in a new PID namespace. From the host, the same process has a different PID (e.g., 18342). From inside the container, /proc shows only the processes in the container's PID namespace.
# From inside a running container
docker exec demo ps -ef
# PID USER COMMAND
# 1 root sleep 1000
# 7 root ps -ef
# Nothing from the host is visible.
# But from the host, the container's processes are right there
ps -p $PID
# PID TTY STAT TIME COMMAND
# 18342 ? Ss 0:00 sleep 1000
This asymmetry is by design: isolation goes one way. The container cannot see the host, but the host can see everything.
Primitive 2: cgroups — "The Kernel Limits the Process"
Where namespaces isolate views, cgroups limit resource usage. Every container is placed in a cgroup, and Docker's -m / --memory, --cpus, --pids-limit, --blkio-weight flags translate directly to cgroup files on the host.
# Start a container with specific limits
docker run -d --name limited --memory=512m --cpus=1 alpine sleep 1000
# The cgroup
cat /proc/$(docker inspect --format='{{.State.Pid}}' limited)/cgroup
# 0::/system.slice/docker-<long-hash>.scope
# Memory limit on cgroup v2
cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.max
# 536870912 <- 512 MiB in bytes
# CPU limit (quota period microseconds)
cat /sys/fs/cgroup/system.slice/docker-*.scope/cpu.max
# 100000 100000 <- 100ms per 100ms = 1 CPU
# Current usage
cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-*.scope/cpu.stat
docker rm -f limited
The cgroup does three things at once:
- Accounts for resource usage (you can read
memory.current,cpu.stat, etc.) - Limits resource usage (processes OOM-kill or throttle when they exceed limits)
- Contains the process tree (every child of the container inherits membership)
docker stats is just a prettier version of reading these cgroup files.
docker run -d --name demo --memory=256m nginx
docker stats --no-stream demo
# CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
# abc123... demo 0.01% 12.3MiB / 256MiB 4.80% ... ... 5
docker rm -f demo
Resource limits are enforced by the kernel. When a container exceeds --memory, the kernel's OOM killer picks a process in that cgroup and kills it. If your container's main process is PID 1 and gets picked, the whole container dies. The OOM killer does not look at host memory pressure — it looks at the container's cgroup memory. This is why exit code 137 (SIGKILL) in docker ps -a almost always means "the container hit its memory limit."
Primitive 3: The Pivoted Rootfs — "The Container Sees a Different /"
The container's filesystem is not a copy of the image. It is an OverlayFS mount composed of:
- Lower layers (read-only): the image's layers as unpacked on the host.
- Upper layer (writable): a directory unique to this container for any changes it makes.
- Merged view: presented to the container as
/.
When the container is created, Docker does the equivalent of:
mkdir /var/lib/docker/overlay2/<container-id>/mergedmount -t overlay overlay -o lowerdir=...,upperdir=...,workdir=... /var/lib/docker/overlay2/<container-id>/mergedpivot_rootinto the merged dir so the container sees it as/- (Old root is unmounted so the host FS is no longer visible from the container)
# Look at a running container's rootfs from the host
docker run -d --name demo alpine sleep 1000
docker inspect --format='{{.GraphDriver.Data.MergedDir}}' demo
# /var/lib/docker/overlay2/<long-hash>/merged
sudo ls /var/lib/docker/overlay2/<long-hash>/merged
# bin etc home lib mnt opt proc root run sbin srv sys tmp usr var
# This is exactly what the container sees as its /
# The container's changes (writes) go to the upper layer
sudo ls /var/lib/docker/overlay2/<long-hash>/diff
# (empty — nothing written yet)
docker exec demo touch /new-file
sudo ls /var/lib/docker/overlay2/<long-hash>/diff
# new-file <- the write landed in the upper layer
docker rm -f demo
OverlayFS is how containers get both "ship a small image" and "let the container write to files." Only writes create copies; reads of unchanged files come straight from the read-only layers. This is "copy-on-write" and it is the reason you can run 100 containers from one image without using 100× the disk space.
Everything Plus Capabilities and Seccomp
Two bonus primitives that make containers safer by default:
Capabilities
Linux splits "root privilege" into dozens of capabilities (CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, CAP_CHOWN, etc.). Docker drops most of them for container processes so that even root inside the container has a limited set.
# See what caps a container has
docker run --rm alpine sh -c 'grep ^Cap /proc/self/status'
# CapInh: 00000000a80425fb
# CapPrm: 00000000a80425fb <- far fewer bits than host root
# CapEff: 00000000a80425fb
# Host root for comparison
grep ^Cap /proc/self/status # as root on the host
# CapInh: 0000000000000000
# CapPrm: 000001ffffffffff <- all 64 caps
# CapEff: 000001ffffffffff
# Decode the default Docker cap set
docker run --rm alpine capsh --print
# Current: cap_chown,cap_dac_override,...,cap_setuid=eip (14 caps, not 40)
seccomp
A Berkeley Packet Filter applied to syscalls. Docker ships a default seccomp profile that blocks roughly 60 syscalls (mount, reboot, kexec_load, ptrace of non-children, etc.). This is why docker run alpine mount fails — even as root inside the container, the mount syscall is blocked.
# Try to mount inside a default container (blocked)
docker run --rm alpine mount -t tmpfs tmpfs /mnt
# mount: /mnt: permission denied.
# With seccomp disabled (DO NOT DO THIS in production)
docker run --rm --security-opt seccomp=unconfined alpine mount -t tmpfs tmpfs /mnt
# (works — seccomp was what was blocking it)
The Full Picture: What docker run Does
What happens when you run `docker run -d nginx`
Click each step to explore
Every step is a standard Linux operation. Docker is the orchestrator.
Why These Three Primitives Are Enough
Namespaces + cgroups + a pivoted rootfs together give you:
- Isolation — the container sees its own processes, network, filesystem, hostname, IPC.
- Limits — the container cannot use more CPU, memory, or I/O than its cgroup allows.
- Packaging — the rootfs is whatever tarball you shipped (the image).
- Portability — the same image runs anywhere there is a Linux kernel that understands the same namespaces and cgroups.
- Speed — start-up is fork + exec + cgroup-attach + overlay-mount. Milliseconds, not seconds.
That is the whole container promise delivered with plain kernel features. No virtual machine monitor, no hardware emulation, no guest OS.
A team spent a week convinced Docker had a resource leak. Containers would accumulate memory on the host until OOMs struck. The team blamed the Docker daemon, upgraded it, rebooted nodes — nothing helped. Finally someone ran ls /sys/fs/cgroup/system.slice/ | wc -l and saw 40,000 orphaned cgroups from containers that had never been properly cleaned. Root cause: a kernel bug combined with a flag Docker had enabled for debugging. The fix was one line in /etc/docker/daemon.json. The lesson: when Docker "misbehaves," it is almost always the underlying primitive (cgroup, namespace, overlayfs, or kernel) that is misbehaving. Docker itself is a thin orchestrator — the real story is one layer down.
Key Concepts Summary
- A container is a recipe of five Linux features: new namespaces, a new cgroup, a pivoted rootfs, dropped capabilities, and a seccomp filter.
- Namespaces isolate what a process sees. Mount, PID, net, UTS, IPC, user, cgroup — Docker uses all seven by default.
- cgroups limit how much a process can use. Docker's
--memory,--cpus,--pids-limitmap directly to cgroup files. - OverlayFS is how containers get copy-on-write filesystems. Read-only lower layers (the image) + writable upper layer (container changes) + merged view.
- Capabilities split root's power. Default Docker drops most caps, so even root inside a container has limited privileges.
- seccomp blocks dangerous syscalls. The default profile blocks ~60 syscalls;
--privilegedand--security-opt seccomp=unconfinedturn it off. - Docker is orchestration. The primitives are all kernel features. Replacing Docker with containerd, runc, or podman changes the CLI and daemon, not what containers are.
- You can debug containers with host tools.
nsenter,/proc/[pid],/sys/fs/cgroup/<scope>let you inspect running containers without entering them.
Common Mistakes
- Assuming Docker implements isolation. It does not — the kernel does. Docker asks the kernel for namespaces and cgroups and wires them up.
- Running
--privilegedcontainers "for convenience." This disables seccomp, gives back all capabilities, and exposes/devfrom the host. The container is effectively a host process. - Confusing image layers with cgroups. Images are how your code is packaged (OverlayFS). cgroups are how your code is limited at runtime. Different primitives, different files on disk.
- Forgetting the container's PID on the host when debugging.
docker inspect --format='{{.State.Pid}}'is faster thandocker execfor reading proc-level info. - Thinking
docker statsis magic. It reads/sys/fs/cgroup/<scope>/memory.current,cpu.stat,io.stat, and formats them. You can bypass Docker entirely and read the files directly. - Treating every runtime as "Docker." Kubernetes clusters often run containerd or CRI-O directly. The containers behave identically — same primitives — but the daemon and CLI differ.
- Skipping the Linux course's Module 5 because "Docker covers it." Docker uses namespaces and cgroups; understanding them first makes Docker obvious instead of mysterious.
A teammate runs `docker run --privileged ubuntu bash`, mounts a host directory from inside, and says 'Look, Docker can do anything!' What are the three specific isolation layers that `--privileged` disabled, and why does this matter for security?