Linux Fundamentals for Engineers

Putting It Together: A Container From Scratch

After someone has been using Docker for a year, the question "what is a container, actually?" tends to produce an answer about images, registries, layers, Dockerfiles, and runtime APIs. None of that is what a container is. A container is a handful of Linux primitives: namespaces, cgroups, a pivoted root, some capabilities dropped, maybe a seccomp filter, arranged into a structure. The image layers and registries are how we ship it. The runtime API is how we automate it. But the thing itself is plain Linux.

The point of this lesson is to build a working container from a shell, using nothing but tools that have shipped in coreutils and util-linux for years. You will see a process think it is PID 1, see its own filesystem, have its own network, and be capped at 1 CPU and 256 MiB of memory, without any container runtime running on the host. When you are done, every Docker command you read will translate back to what you did by hand, and debugging containers at the systems level will become straightforward.

The Recipe

To create a container, you combine these primitives:

A root filesystem: a directory containing a Linux userspace (busybox, Alpine, Ubuntu, whatever).
New namespaces: mnt, pid, net, uts, ipc, optionally user and cgroup.
A cgroup: with CPU, memory, PIDs limits.
pivot_root (or chroot), to make the new rootfs the process's /.
Mount the right things inside: /proc for the new PID namespace, /sys, /dev, maybe /tmp.
Drop capabilities and apply seccomp, the security story (we will skim this in favor of the mechanics).
Execute the target program.

Every container runtime: Docker, containerd, runc, podman, does these steps for you. We are going to do them ourselves.

KEY CONCEPT

A container is not a virtual machine. It is a process that has been set up with a curated view of its surroundings. After this lesson you will have built one by hand, and "container" will mean something concrete to you: a process, several namespace inode numbers, a cgroup path, and a pivoted rootfs. No image required, no runtime required, the kernel supplies everything.

Step 1: Get a Root Filesystem

The easiest starting point is a prebuilt minimal filesystem. Busybox or Alpine rootfs tarballs are tiny (~5-10 MB) and contain enough userspace to be useful.

# Option A: debootstrap a small Ubuntu (needs debootstrap package)
sudo mkdir -p /srv/containerlab/rootfs
sudo debootstrap --variant=minbase jammy /srv/containerlab/rootfs

# Option B: extract an Alpine minirootfs tarball
mkdir -p ~/containerlab/rootfs
curl -L https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz \
  | tar -xz -C ~/containerlab/rootfs

# Option C: steal one from a running container (easiest if you already have docker)
docker export $(docker create alpine:3.19) | tar -xf - -C ~/containerlab/rootfs

# Verify
ls ~/containerlab/rootfs
# bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

This directory now contains everything a minimal Linux userspace needs: coreutils, shell, libc, and so on. What it is missing: a kernel. There is no kernel inside, we are going to share the host's.

Step 2: Create a cgroup and Set Limits

# Create a cgroup under the v2 hierarchy
sudo mkdir /sys/fs/cgroup/mycontainer

# Enable controllers in the parent so children get them
echo "+cpu +memory +pids" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# Apply limits
echo "100000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max     # 1 CPU
echo $((256 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/mycontainer/memory.max   # 256 MiB
echo 128 | sudo tee /sys/fs/cgroup/mycontainer/pids.max                # 128 tasks

# Verify
cat /sys/fs/cgroup/mycontainer/cpu.max
# 100000 100000
cat /sys/fs/cgroup/mycontainer/memory.max
# 268435456

These limits will apply to every process we add to /sys/fs/cgroup/mycontainer/cgroup.procs.

Step 3: Spawn a Shell in New Namespaces

unshare is the tool for creating namespaces on the fly. We want all the usual container-style namespaces:

# Create mnt, uts, ipc, pid, net namespaces; run bash inside;
# --fork because the first PID in a new PID namespace is immortal — we need
# a child of unshare to be PID 1, not unshare itself
sudo unshare \
    --mount --uts --ipc --pid --net --cgroup \
    --fork --mount-proc \
    bash

After this, you are in a new shell with brand-new namespaces. ps will lie but not helpfully yet, the rootfs is still the host's, so nothing has changed visually. Let us fix that next.

Step 4: Pivot to the Container Rootfs

pivot_root is the supported way to switch a process's root to a different filesystem. It requires the new root to be its own mount, so we start with a bind-mount of the rootfs onto itself.

From inside the unshared shell:

# Variables
ROOT=/root/containerlab/rootfs
cd $ROOT

# Make a mount namespace's view of the rootfs — bind to itself so it becomes a mount
mount --bind . .

# Prepare a place for the old root (pivot_root needs it)
mkdir -p .old_root

# Perform the pivot: new root = $ROOT, old root mounts at $ROOT/.old_root
pivot_root . .old_root

# Move into the new root
cd /

# Unmount the old root (we do not want the host visible from inside)
umount -l /.old_root
rmdir /.old_root

Now / is the container's rootfs. ls / should look like the Alpine or Ubuntu userspace you laid out in step 1, not the host's.

Mount the pseudo filesystems inside

# We passed --mount-proc to unshare so /proc is already done for us.
# Add /sys and make sure /dev is reasonable:

mount -t sysfs sys /sys 2>/dev/null || true
mount -t tmpfs -o nosuid,size=64m tmpfs /tmp
mount -t devtmpfs dev /dev 2>/dev/null || mount -t tmpfs none /dev

# Verify
mount | head -8
findmnt / -n

Step 5: Put Our Shell Into the cgroup

Back out of the container shell for a moment, or open a second host terminal, and write the unshared process's PID into the cgroup:

# On the host (second terminal)
# Find the unshared shell's PID
pgrep -f 'bash' | while read pid; do
  if grep -q '^0::/$' /proc/$pid/cgroup 2>/dev/null; then echo $pid; fi
done

# Actually easier: pick the shell you launched with `unshare` by looking at ps auxf
ps auxf | grep -A1 unshare | head

# Move it into the cgroup
echo $PID | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs

From now on, every command run in that shell, and every child it forks, is constrained to 1 CPU and 256 MiB of memory.

Step 6: Confirm You Are in a Container

Back in the unshared, pivoted, cgroup-bounded shell:

# Our "container" has its own process view
ps aux
# PID   USER     COMMAND
# 1     root     bash
# 14    root     ps aux
# That's it — PID 1 is our shell. No host processes visible.

# Our own hostname
hostname
# Some default — try setting it
hostname my-container
hostname
# my-container

# Our own network — nothing plugged in (we didn't configure any interfaces)
ip link
# 1: lo: LOOPBACK mtu 65536 qdisc noop state DOWN ...
# Only lo, not up.

# Our own mount view
mount | head
# / is our rootfs; /proc, /sys, /tmp are fresh inside here.

# Confirm the cgroup limit
cat /sys/fs/cgroup/memory.max            # reads from our namespaced view
# 268435456

# Try to exceed the memory limit
python3 -c 'a = "x" * (300 * 1024 * 1024)'
# Killed            <- OOM killed by the cgroup

At this point, you have: a process that thinks it is PID 1, cannot see any host processes, has its own hostname, its own mount view, its own network namespace, and is bounded by a cgroup. That is a container. Nothing from Docker is involved.

Step 7: Add Networking (the Hard Part)

Our container has a network namespace, but it is empty except for loopback. To give it real connectivity we need to create a veth pair, a pair of virtual interfaces where traffic on one pops out the other.

Here is the outline. Do this with two shells, or save the PID of the unshared bash and operate on the namespace by path.

# On the host (first shell):

# Find the container shell PID (we will call it $CPID)
CPID=$PID

# Expose the container's netns as /var/run/netns/mycontainer so `ip netns` can use it
sudo mkdir -p /var/run/netns
sudo ln -sf /proc/$CPID/ns/net /var/run/netns/mycontainer

# Create veth pair: veth0 on host, veth1 goes into the namespace
sudo ip link add veth0 type veth peer name veth1
sudo ip link set veth1 netns mycontainer

# Configure the host side
sudo ip addr add 10.200.0.1/24 dev veth0
sudo ip link set veth0 up

# Configure the container side (from inside the ns, using nsenter or ip netns exec)
sudo ip netns exec mycontainer ip link set lo up
sudo ip netns exec mycontainer ip addr add 10.200.0.2/24 dev veth1
sudo ip netns exec mycontainer ip link set veth1 up
sudo ip netns exec mycontainer ip route add default via 10.200.0.1

# Enable IP forwarding + NAT on the host so container can reach the outside world
sudo sysctl -w net.ipv4.ip_forward=1
sudo iptables -t nat -A POSTROUTING -s 10.200.0.0/24 ! -o veth0 -j MASQUERADE

# Test from inside the container
# (back in the unshared shell)
ping -c1 10.200.0.1           # host side of veth — should work
ping -c1 1.1.1.1              # out to the internet — should work thanks to NAT

This is, at a low level, exactly what the Docker bridge network does: a bridge on the host, a veth pair for each container, NAT for outbound traffic, port-forwarding rules for inbound.

PRO TIP

Every time you see docker0, a veth* interface on the host, or a Kubernetes CNI plugin like flannel, calico, or cilium, they are automating the veth-pair-and-NAT choreography you just did by hand. Building this once is the difference between "Docker's network is magic" and "Docker is a wrapper around twelve ip commands."

Step 8: Drop Capabilities and Apply seccomp (Optional)

Our hand-built container is a privileged container: inside the namespace, the shell is UID 0 with full capabilities. A real container runtime would drop most capabilities and apply a seccomp filter. You can do this too, but it requires a helper binary (like capsh or a small C program) because the capability drop has to happen between clone() and execve().

The shortcut: use setpriv to run a command with dropped capabilities:

# Instead of bash, run our target with reduced privileges
setpriv --clear-groups --no-new-privs \
        --inh-caps=-all --bounding-set=-all --ambient-caps=-all \
        bash

Real runtimes also apply a seccomp BPF filter blocking dangerous syscalls (mount, reboot, kexec_load, many more). You can generate one with oci-image-tool or use Docker's default seccomp profile as a reference, but writing one from scratch is beyond the scope of this lesson.

What Docker Adds on Top

Everything Docker adds is valuable, but it is above the container, not inside it:

Images: tarball layers stored in a local cache, assembled into a rootfs via overlayfs.
Registries: pull layers by digest from a central server.
A daemon and CLI: automate all the above steps reliably, expose an API.
Networking plugins: abstractions over the veth/bridge/NAT/port-forward dance.
Image building: Dockerfile → series of commits to a writable layer.
Integration with orchestrators: CRI for Kubernetes, compose for local, swarm for simple clustering.

Every one of those sits above the core container. If you understand the core, you understand what each Docker feature is automating.

Clean Up

# Exit the container shell
exit

# Remove the cgroup (must be empty first — no processes, no child cgroups)
sudo rmdir /sys/fs/cgroup/mycontainer

# Remove the network namespace if you created a symlink for it
sudo rm -f /var/run/netns/mycontainer

# Remove NAT rule
sudo iptables -t nat -D POSTROUTING -s 10.200.0.0/24 ! -o veth0 -j MASQUERADE

# Tear down veth (goes away when the namespace dies, but clean if still present)
sudo ip link del veth0 2>/dev/null || true

# Rootfs dir — remove when you no longer need it
rm -rf ~/containerlab

Why This Matters

You will probably never build a container by hand in production. But having done it once:

Debugging container networking is now "enter the namespace with nsenter, inspect with ip/ss/tcpdump." Not magic.
Understanding OOM-killed pods is now "read memory.current, memory.max, memory.events in the pod's cgroup." Not mystery.
Reasoning about security is now "which caps are dropped, which namespaces are shared with the host, what seccomp filter is applied." Not incantations.
Choosing runtimes (containerd vs runc vs crun vs gVisor) becomes "they all invoke these primitives; they differ in how much extra isolation they layer on top." Not brand loyalty.

When a container misbehaves, your mental model is "what is the host-side view of the process and its cgroup/namespaces?", and every question collapses into things you can answer with ps, cat /proc/..., cat /sys/fs/cgroup/..., and nsenter.

Key Concepts Summary

A container is a process + namespaces + cgroup + rootfs + dropped privileges. Nothing more fundamental.
unshare creates namespaces, pivot_root switches the rootfs, cgroup directories impose resource limits. All tools in base Linux packages.
pivot_root is the correct way to switch roots, chroot works but is less rigorous about hiding the old root.
Networking requires a veth pair, routing, and usually NAT to give a container outbound connectivity. This is what every container runtime automates.
cgroup limits apply to the process and its children as long as they stay in the cgroup.
Host-side tools work on container processes via nsenter, /proc/$PID/*, and /sys/fs/cgroup/....
Docker and Kubernetes are automation over these primitives, not alternatives to them.
Privilege dropping (capabilities, seccomp, LSMs, user namespaces) is the security story, layered on top of the isolation namespaces provide.

Common Mistakes

Skipping the --fork flag to unshare with --pid. Without it, the shell itself becomes PID 1 of the new namespace, and when it dies, the namespace is torn down immediately.
Forgetting to mount /proc inside the new namespace. Tools that read /proc will show the host's view, defeating the point.
Using chroot instead of pivot_root. chroot can still leak the old root; pivot_root is the safer choice.
Enabling controllers in cgroup.subtree_control after creating the child cgroup. They need to be enabled on the parent first.
Deleting a cgroup that still has processes in it. The rmdir silently fails; move the processes out first.
Creating a network namespace and wondering why DNS does not work. The container sees /etc/resolv.conf from its mount namespace: which, if the rootfs does not contain it, is empty. Copy resolv.conf or bind-mount one in.
Thinking a privileged rootless "container" is secure because it uses namespaces. Without seccomp and capability drops, it is not.
Confusing mount --bind with mount -o bind on the command line. Same thing on modern util-linux, but older scripts sometimes mix them, always prefer mount --bind.
Trying to docker exec into the hand-built container. Docker cannot see it, you made it with your own tools. Use nsenter -t $PID -a -- sh to "exec" in.

KNOWLEDGE CHECK

You have built a container by hand using `unshare --pid --fork --mount --uts --net`. Inside, `ps` correctly shows your shell as PID 1. You run a long-running program in the container. From the host, you can see the program has real PID 18342. A colleague signs into the host and runs `kill 18342`. What happens inside the container?

cgroups v1 vs v2

Continue

Performance Triage (The USE Method)

←→ navigateM toggle sidebar