Docker & Container Fundamentals

The Three Primitives — Namespaces, cgroups, chroot

A junior on the team asks: "What does Docker do? Like, if I ran docker run alpine sh, what is actually happening on the host?" The senior pauses, opens a terminal, and types three commands — unshare, pivot_root, echo PID > cgroup.procs. Thirty seconds later there is a new shell that thinks it is PID 1, sees a different filesystem, and is capped at one CPU. "That," the senior says, "is what Docker does. Those three things. Everything else is plumbing."

Every container runtime — Docker, containerd, runc, podman, Kubernetes' CRI implementations — is built on three Linux primitives: namespaces (isolation), cgroups (resource limits), and a pivoted rootfs (filesystem view). Once you have seen how they compose, "what is a container" goes from mystery to recipe. This lesson walks through all three, how they interlock, and how Docker invokes them for you when you type docker run.


The Recipe, in One Page

A container is:

  1. A process (spawned via clone())...
  2. ...placed in new namespaces (to isolate its view)...
  3. ...with its root filesystem pivoted (so its / is the image's rootfs, not the host's)...
  4. ...attached to a cgroup (to limit CPU, memory, I/O)...
  5. ...with capabilities dropped and optionally a seccomp filter (to limit what syscalls it can make).

That is it. Every container runtime does those five steps. The Linux kernel provides all of them directly — no virtualization, no emulation. This is why you could build a container by hand in Module 5 of the Linux course, and it is why Docker is "just" a user-friendly wrapper.

KEY CONCEPT

Docker did not invent containers. LXC had most of this before Docker existed; FreeBSD jails had a subset in 2000; Solaris zones had it in 2004. What Docker invented was a friendly CLI (docker run), a standard image format, and a registry for distribution. The isolation primitives are Linux kernel features, not Docker features — which is why Docker can be replaced with containerd, runc, or podman without changing how containers behave.


Primitive 1: Namespaces — "The Kernel Lies to the Process"

Namespaces isolate what a process can see. Eight kinds (covered in full in the Linux course, Module 5), of which Docker uses seven by default:

NamespaceWhat it isolatesWhat the container sees vs the host
mntMount tableOwn /, /etc, /var — the image's rootfs
pidPID number spaceContainer's main process is PID 1
netNetwork interfaces, routes, socketsOwn eth0 (or none), own routing table
utsHostname + NIS domainContainer has its own hostname
ipcSystem V IPC + POSIX message queuesNo shared memory with host processes
userUID/GID mappings (optional)Container's root can be unprivileged on host
cgroupView into /sys/fs/cgroupCgroup paths rooted at the container's cgroup

Every docker run creates a new set of these and puts the container's process in them.

See namespaces at work

# Start a container
docker run -d --name demo alpine sleep 1000
PID=$(docker inspect --format='{{.State.Pid}}' demo)
echo "Container PID on host: $PID"

# Inspect the namespaces the container is in
ls -l /proc/$PID/ns | head
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532421]'
# lrwxrwxrwx 1 root root 0 ... net -> 'net:[4026532422]'
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532423]'
# ...
# Each inode number is a distinct namespace.

# Compare with host PID 1's namespaces (systemd on most systems)
diff <(ls -l /proc/1/ns | awk '{print $9, $11}') \
     <(ls -l /proc/$PID/ns | awk '{print $9, $11}')
# Differs on: mnt, net, pid, uts, ipc, (possibly user, cgroup)
# Same on:  (varies by runtime)

# Hop into the container's namespaces with nsenter — no docker exec needed
sudo nsenter -t $PID -n ip addr
# eth0@if8 ... 172.17.0.2/16  <- the container's network view

sudo nsenter -t $PID -m ls /
# bin  etc  lib  usr  ...      <- the container's filesystem view

docker rm -f demo
PRO TIP

nsenter -t <host-pid> -n <cmd> is the single most useful command for debugging container networking from the host. You get host-side tools (ip, ss, tcpdump, ping) with the container's network view. No need to install anything inside the minimal container image — you already have everything on the host.

PID 1 inside the container

The container's main process sees itself as PID 1 because it is in a new PID namespace. From the host, the same process has a different PID (e.g., 18342). From inside the container, /proc shows only the processes in the container's PID namespace.

# From inside a running container
docker exec demo ps -ef
# PID   USER     COMMAND
# 1     root     sleep 1000
# 7     root     ps -ef
# Nothing from the host is visible.

# But from the host, the container's processes are right there
ps -p $PID
# PID  TTY  STAT  TIME  COMMAND
# 18342 ?   Ss   0:00  sleep 1000

This asymmetry is by design: isolation goes one way. The container cannot see the host, but the host can see everything.


Primitive 2: cgroups — "The Kernel Limits the Process"

Where namespaces isolate views, cgroups limit resource usage. Every container is placed in a cgroup, and Docker's -m / --memory, --cpus, --pids-limit, --blkio-weight flags translate directly to cgroup files on the host.

# Start a container with specific limits
docker run -d --name limited --memory=512m --cpus=1 alpine sleep 1000

# The cgroup
cat /proc/$(docker inspect --format='{{.State.Pid}}' limited)/cgroup
# 0::/system.slice/docker-<long-hash>.scope

# Memory limit on cgroup v2
cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.max
# 536870912       <- 512 MiB in bytes

# CPU limit (quota period microseconds)
cat /sys/fs/cgroup/system.slice/docker-*.scope/cpu.max
# 100000 100000   <- 100ms per 100ms = 1 CPU

# Current usage
cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.current
cat /sys/fs/cgroup/system.slice/docker-*.scope/cpu.stat

docker rm -f limited

The cgroup does three things at once:

  1. Accounts for resource usage (you can read memory.current, cpu.stat, etc.)
  2. Limits resource usage (processes OOM-kill or throttle when they exceed limits)
  3. Contains the process tree (every child of the container inherits membership)

docker stats is just a prettier version of reading these cgroup files.

docker run -d --name demo --memory=256m nginx
docker stats --no-stream demo
# CONTAINER ID   NAME   CPU %   MEM USAGE / LIMIT   MEM %   NET I/O  BLOCK I/O   PIDS
# abc123...     demo   0.01%   12.3MiB / 256MiB    4.80%   ...      ...         5

docker rm -f demo
WARNING

Resource limits are enforced by the kernel. When a container exceeds --memory, the kernel's OOM killer picks a process in that cgroup and kills it. If your container's main process is PID 1 and gets picked, the whole container dies. The OOM killer does not look at host memory pressure — it looks at the container's cgroup memory. This is why exit code 137 (SIGKILL) in docker ps -a almost always means "the container hit its memory limit."


Primitive 3: The Pivoted Rootfs — "The Container Sees a Different /"

The container's filesystem is not a copy of the image. It is an OverlayFS mount composed of:

  • Lower layers (read-only): the image's layers as unpacked on the host.
  • Upper layer (writable): a directory unique to this container for any changes it makes.
  • Merged view: presented to the container as /.

When the container is created, Docker does the equivalent of:

  1. mkdir /var/lib/docker/overlay2/<container-id>/merged
  2. mount -t overlay overlay -o lowerdir=...,upperdir=...,workdir=... /var/lib/docker/overlay2/<container-id>/merged
  3. pivot_root into the merged dir so the container sees it as /
  4. (Old root is unmounted so the host FS is no longer visible from the container)
# Look at a running container's rootfs from the host
docker run -d --name demo alpine sleep 1000
docker inspect --format='{{.GraphDriver.Data.MergedDir}}' demo
# /var/lib/docker/overlay2/<long-hash>/merged

sudo ls /var/lib/docker/overlay2/<long-hash>/merged
# bin  etc  home  lib  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

# This is exactly what the container sees as its /
# The container's changes (writes) go to the upper layer
sudo ls /var/lib/docker/overlay2/<long-hash>/diff
# (empty — nothing written yet)

docker exec demo touch /new-file
sudo ls /var/lib/docker/overlay2/<long-hash>/diff
# new-file                          <- the write landed in the upper layer

docker rm -f demo
KEY CONCEPT

OverlayFS is how containers get both "ship a small image" and "let the container write to files." Only writes create copies; reads of unchanged files come straight from the read-only layers. This is "copy-on-write" and it is the reason you can run 100 containers from one image without using 100× the disk space.


Everything Plus Capabilities and Seccomp

Two bonus primitives that make containers safer by default:

Capabilities

Linux splits "root privilege" into dozens of capabilities (CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, CAP_CHOWN, etc.). Docker drops most of them for container processes so that even root inside the container has a limited set.

# See what caps a container has
docker run --rm alpine sh -c 'grep ^Cap /proc/self/status'
# CapInh: 00000000a80425fb
# CapPrm: 00000000a80425fb   <- far fewer bits than host root
# CapEff: 00000000a80425fb

# Host root for comparison
grep ^Cap /proc/self/status     # as root on the host
# CapInh: 0000000000000000
# CapPrm: 000001ffffffffff       <- all 64 caps
# CapEff: 000001ffffffffff

# Decode the default Docker cap set
docker run --rm alpine capsh --print
# Current: cap_chown,cap_dac_override,...,cap_setuid=eip  (14 caps, not 40)

seccomp

A Berkeley Packet Filter applied to syscalls. Docker ships a default seccomp profile that blocks roughly 60 syscalls (mount, reboot, kexec_load, ptrace of non-children, etc.). This is why docker run alpine mount fails — even as root inside the container, the mount syscall is blocked.

# Try to mount inside a default container (blocked)
docker run --rm alpine mount -t tmpfs tmpfs /mnt
# mount: /mnt: permission denied.

# With seccomp disabled (DO NOT DO THIS in production)
docker run --rm --security-opt seccomp=unconfined alpine mount -t tmpfs tmpfs /mnt
# (works — seccomp was what was blocking it)

The Full Picture: What docker run Does

What happens when you run `docker run -d nginx`

Click each step to explore

Every step is a standard Linux operation. Docker is the orchestrator.


Why These Three Primitives Are Enough

Namespaces + cgroups + a pivoted rootfs together give you:

  • Isolation — the container sees its own processes, network, filesystem, hostname, IPC.
  • Limits — the container cannot use more CPU, memory, or I/O than its cgroup allows.
  • Packaging — the rootfs is whatever tarball you shipped (the image).
  • Portability — the same image runs anywhere there is a Linux kernel that understands the same namespaces and cgroups.
  • Speed — start-up is fork + exec + cgroup-attach + overlay-mount. Milliseconds, not seconds.

That is the whole container promise delivered with plain kernel features. No virtual machine monitor, no hardware emulation, no guest OS.

WAR STORY

A team spent a week convinced Docker had a resource leak. Containers would accumulate memory on the host until OOMs struck. The team blamed the Docker daemon, upgraded it, rebooted nodes — nothing helped. Finally someone ran ls /sys/fs/cgroup/system.slice/ | wc -l and saw 40,000 orphaned cgroups from containers that had never been properly cleaned. Root cause: a kernel bug combined with a flag Docker had enabled for debugging. The fix was one line in /etc/docker/daemon.json. The lesson: when Docker "misbehaves," it is almost always the underlying primitive (cgroup, namespace, overlayfs, or kernel) that is misbehaving. Docker itself is a thin orchestrator — the real story is one layer down.


Key Concepts Summary

  • A container is a recipe of five Linux features: new namespaces, a new cgroup, a pivoted rootfs, dropped capabilities, and a seccomp filter.
  • Namespaces isolate what a process sees. Mount, PID, net, UTS, IPC, user, cgroup — Docker uses all seven by default.
  • cgroups limit how much a process can use. Docker's --memory, --cpus, --pids-limit map directly to cgroup files.
  • OverlayFS is how containers get copy-on-write filesystems. Read-only lower layers (the image) + writable upper layer (container changes) + merged view.
  • Capabilities split root's power. Default Docker drops most caps, so even root inside a container has limited privileges.
  • seccomp blocks dangerous syscalls. The default profile blocks ~60 syscalls; --privileged and --security-opt seccomp=unconfined turn it off.
  • Docker is orchestration. The primitives are all kernel features. Replacing Docker with containerd, runc, or podman changes the CLI and daemon, not what containers are.
  • You can debug containers with host tools. nsenter, /proc/[pid], /sys/fs/cgroup/<scope> let you inspect running containers without entering them.

Common Mistakes

  • Assuming Docker implements isolation. It does not — the kernel does. Docker asks the kernel for namespaces and cgroups and wires them up.
  • Running --privileged containers "for convenience." This disables seccomp, gives back all capabilities, and exposes /dev from the host. The container is effectively a host process.
  • Confusing image layers with cgroups. Images are how your code is packaged (OverlayFS). cgroups are how your code is limited at runtime. Different primitives, different files on disk.
  • Forgetting the container's PID on the host when debugging. docker inspect --format='{{.State.Pid}}' is faster than docker exec for reading proc-level info.
  • Thinking docker stats is magic. It reads /sys/fs/cgroup/<scope>/memory.current, cpu.stat, io.stat, and formats them. You can bypass Docker entirely and read the files directly.
  • Treating every runtime as "Docker." Kubernetes clusters often run containerd or CRI-O directly. The containers behave identically — same primitives — but the daemon and CLI differ.
  • Skipping the Linux course's Module 5 because "Docker covers it." Docker uses namespaces and cgroups; understanding them first makes Docker obvious instead of mysterious.

KNOWLEDGE CHECK

A teammate runs `docker run --privileged ubuntu bash`, mounts a host directory from inside, and says 'Look, Docker can do anything!' What are the three specific isolation layers that `--privileged` disabled, and why does this matter for security?