Linux Fundamentals for Engineers

Namespaces Explained

An engineer asks: "What is a container, actually?" The usual answer — "a lightweight VM" — is wrong. A container is not a virtualized machine. It is a normal Linux process that the kernel has lied to. The process sees a filesystem that is not the host's filesystem. It sees a network interface that does not exist outside of it. It sees PID 1 where there is no PID 1 on the host. It sees its own hostname, its own IPC, its own users. None of it is emulated — the kernel simply shows this process a custom view of reality.

The mechanism that enables those lies is namespaces. Seven of them, each isolating one kind of kernel resource. Once you understand namespaces as a list of seven things the kernel can lie about, containers stop being magic — they are a kernel feature you could invoke by hand from a shell. This lesson is that list: what each namespace isolates, how to see them on a running system, and how to build your own.


What a Namespace Is

A namespace is a kernel-level abstraction that isolates some class of global resource, so that the processes inside the namespace see a different instance of that resource than processes outside it. The kernel manages many global things: the set of PIDs, the set of mounts, the hostname, the routing table. A namespace says "make a separate set of these, visible only to members of this namespace."

Three operations define the interface:

  • clone() or unshare() with one or more CLONE_NEW* flags — creates a new namespace for the calling process.
  • setns() — move an existing process into an existing namespace.
  • Bind-mounting /proc/[pid]/ns/<kind> — pin a namespace so it outlives its last process (the core trick ip netns add uses).

Every process belongs to one namespace per kind. Linux has seven kinds:

NamespaceCLONE_NEW* flagWhat it isolates
mntCLONE_NEWNSThe mount table — what is mounted where
pidCLONE_NEWPIDThe PID number space — who is PID 1
netCLONE_NEWNETNetwork interfaces, routing tables, firewall rules, sockets
ipcCLONE_NEWIPCSystem V IPC, POSIX message queues
utsCLONE_NEWUTSHostname and NIS domain name
userCLONE_NEWUSERUser and group IDs, capabilities
cgroupCLONE_NEWCGROUPThe view into /sys/fs/cgroup
time (newer)CLONE_NEWTIMEThe monotonic and boot clocks (available but rarely used)
KEY CONCEPT

A container is a process (and its children) that belongs to a set of namespaces different from the host's. That is the entire definition. Everything else a container runtime does — pulling images, setting up overlay mounts, configuring networking, applying seccomp filters — is plumbing around the core idea: clone a process with new namespaces, then make sure the inside looks the way you want.


Seeing Namespaces on a Live System

Every process has a /proc/[pid]/ns/ directory with a symlink per namespace kind. The symlink's target contains a unique inode — two processes with the same inode are in the same namespace.

# Your shell's namespaces
ls -l /proc/$$/ns
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 ipc    -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 mnt    -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 net    -> 'net:[4026531992]'
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 pid    -> 'pid:[4026531836]'
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 time   -> 'time:[4026531834]'
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 user   -> 'user:[4026531837]'
# lrwxrwxrwx 1 admin admin 0 Apr 19 10:00 uts    -> 'uts:[4026531838]'

# Compare with a containerized process — find a container PID on the host
pgrep -f nginx | head -1                   # pick any running container process
ls -l /proc/$PID/ns
# Some inodes differ — those are the namespaces the container runtime created

# Group processes by what namespace they live in
lsns                         # list every namespace on the system
# NS         TYPE   NPROCS  PID USER   COMMAND
# 4026531836 pid        245    1 root   /lib/systemd/systemd
# 4026532420 pid          8 12345 65535 nginx: master
# 4026532421 net          8 12345 65535 nginx: master
# 4026532422 mnt          8 12345 65535 nginx: master
# ...

# Just one kind of namespace
lsns --type net

If two processes have the same inode for net:, they share a network namespace (same interfaces, same routing table). The host and a container almost always differ on mnt, pid, net, and uts; they may share ipc and user depending on the runtime.


The Seven Namespaces in Detail

mnt — Mount Namespace

Isolates the mount table. A process in its own mnt namespace sees a different set of mounts from the host — the same kernel, a different view of the filesystem tree.

This is what lets a container have / be its overlay rootfs while the host has / on nvme0n1p2. The container's mounts do not appear on the host, and the host's mounts do not appear in the container (unless the runtime explicitly bind-mounts them in).

# Inside the container's mount namespace
sudo nsenter -t $CPID -m -- cat /proc/self/mountinfo | head -5
# Different from the host's /proc/self/mountinfo

# Or create one by hand right now
sudo unshare -m bash
# Inside: every mount you make is invisible to the host
mount -t tmpfs tmpfs /tmp
findmnt /tmp
exit
# Back in the host: /tmp is still its original mount

pid — PID Namespace

Isolates the PID number space. The first process in a new PID namespace gets PID 1. It is also the namespace's init — if it dies, every process in the namespace dies with it.

# Create a new PID namespace — the shell inside is PID 1
sudo unshare --pid --fork --mount-proc bash
# Inside:
ps aux
# USER   PID ... COMMAND
# root     1 ... bash        <- our shell is PID 1
# root     8 ... ps
# nothing from the host is visible
# Exit the shell and the namespace is cleaned up
exit

Two important properties:

  • PIDs in a child namespace are different numbers from the host. The same process has a different PID depending on which namespace is asking.
  • The host can still see every process — it just sees them with their host-side PIDs. /proc/$HOSTPID/status shows the NSpid: field, which lists the PIDs this process has in each namespace it belongs to.
PRO TIP

cat /proc/$PID/status | grep NSpid tells you the PID a process has in its own PID namespace. On the host, the container's "PID 1" process might be PID 18234; inside the container, the same process sees itself as PID 1. The NSpid: 18234 1 field captures both views.

net — Network Namespace

Isolates networking: interfaces, routing tables, ARP tables, firewall rules (iptables/nftables), sockets. Inside a fresh net namespace you see only the loopback (lo) and you need to add interfaces to it — typically by moving a veth pair in from the host.

# Create a network namespace with the iproute2 tool
sudo ip netns add demo
sudo ip netns list
# demo

# Run a command in it
sudo ip netns exec demo ip link
# 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN ...
# only lo — no eth0, no wireless, nothing

# Bring lo up inside it
sudo ip netns exec demo ip link set lo up
sudo ip netns exec demo ip addr
# 1: lo: <LOOPBACK,UP> ... inet 127.0.0.1/8 ...

# Clean up
sudo ip netns del demo

All container networking (Docker bridges, Kubernetes CNI plugins, Istio sidecars) is built on manipulations of network namespaces: create the namespace, add a virtual interface (veth) with one end inside and one end on the host, set up routing, apply iptables rules, done.

ipc — IPC Namespace

Isolates System V IPC (shared memory segments, message queues, semaphores) and POSIX message queues. Processes in different IPC namespaces cannot see each other's IPC objects.

This matters for legacy software that uses shmget() / msgget() / semget() — multiple instances in separate containers will not collide. For modern code, which mostly uses Unix sockets or shared memory via shm_open()/mmap(), this namespace rarely affects you.

ipcs                              # see IPC objects in your namespace
# ------ Message Queues --------
# ...
# ------ Shared Memory Segments --------
# ...
# ------ Semaphore Arrays --------
# ...

sudo unshare --ipc bash           # new IPC namespace
ipcs                              # completely empty — host's IPC not visible

uts — UTS Namespace

Isolates hostname and NIS domain. Lets a container sethostname("web-01") without affecting the host.

hostname
# host.example.com

sudo unshare --uts bash
hostname container-1
hostname
# container-1         <- changed inside the namespace
exit

hostname
# host.example.com    <- host unaffected

Small namespace, but every container runtime uses it — hostname inside a pod reporting the pod name or something else the runtime chose is pure UTS namespace at work.

user — User Namespace

The most powerful and dangerous one. Isolates user IDs and group IDs — a process can be UID 0 (root) inside the namespace while being a regular unprivileged user on the host. Combined with capabilities, this is how rootless containers work.

# Create a user namespace and map your host UID 1000 to UID 0 inside
unshare --user --map-root-user bash
# Inside:
id
# uid=0(root) gid=0(root) groups=0(root),...

cat /proc/self/uid_map
# 0       1000     1                <- inside UID 0 = host UID 1000

# But real privileges are still limited — can't chown host files
touch /etc/hosts
# Touch: cannot touch '/etc/hosts': Permission denied
exit

User namespaces are how Docker-rootless, Podman, and Kubernetes' user-namespace support let non-root host users run containers with "root" inside. It is a big deal for security: a container escape as "root-in-namespace" still lands on the host as UID 1000, limiting blast radius.

cgroup — Cgroup Namespace

Isolates the view into /sys/fs/cgroup. Processes in a cgroup namespace see their cgroup hierarchy rooted at their own cgroup, not the host's.

Before cgroup namespaces existed, a container could cat /proc/self/cgroup and see /docker/abc123... — exposing the container runtime's paths. With cgroup namespaces, it sees / — a clean, rooted view.

cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-3.scope

# Inside a container (with a cgroup namespace):
# cat /proc/self/cgroup
# 0::/                <- rooted at the container's own cgroup

time — Time Namespace

Newer (Linux 5.6+). Isolates CLOCK_MONOTONIC and CLOCK_BOOTTIME — the clocks that count forward from boot. Lets checkpoint/restore tools preserve monotonic time when a process moves between hosts.

Rarely used directly. Docker and most container runtimes do not use it today.


How Containers Assemble Them

A Docker container typically creates new namespaces for:

  • mnt — so the overlay rootfs is the container's /.
  • pid — so the container has its own PID 1.
  • net — so the container has its own interfaces (unless --network=host).
  • uts — so the container can set its own hostname.
  • ipc — for isolation of IPC objects.
  • Sometimes user — for rootless mode or explicit user namespace mapping.
  • Always cgroup — for a clean cgroup view.
# See what namespaces a specific container uses
docker run -d --name demo alpine sleep 1000
PID=$(docker inspect --format='{{.State.Pid}}' demo)
ls -l /proc/$PID/ns

# Compare to your shell
diff <(ls -l /proc/$$/ns | awk '{print $9, $11}') \
     <(ls -l /proc/$PID/ns | awk '{print $9, $11}')

docker rm -f demo

The differences you will see are exactly the namespaces Docker chose to create a new instance of.


Tools That Work With Namespaces

unshare — run a command in new namespaces

# All common namespaces + new hostname + new PID tree
sudo unshare --uts --pid --fork --mount --mount-proc --ipc bash
# Now you are effectively in a mini-container:
hostname sandbox
ps aux
mount -t tmpfs tmpfs /mnt
exit
# Everything dissolves when the shell exits

nsenter — enter an existing namespace

The essential tool for debugging containers from the host.

# Enter all namespaces of a container's PID
sudo nsenter -t $PID -a

# Enter just the network namespace (super handy for debugging container networking)
sudo nsenter -t $PID -n ip addr
sudo nsenter -t $PID -n ss -tlnp

# Enter mount namespace to see what the container sees in the FS
sudo nsenter -t $PID -m ls /etc
PRO TIP

sudo nsenter -t $CPID -n ip addr is the 30-second debug for "why can the container not reach X?" It drops you into the container's network namespace with host-side tools like ip, ss, tcpdump, ping, etc. — none of which you have to install inside the container image. Learning nsenter removes 80% of the pain of debugging minimal container images.

ip netns — dedicated to network namespaces

sudo ip netns add lab
sudo ip netns exec lab ip addr
sudo ip netns exec lab bash          # drop into a shell in the namespace
sudo ip netns del lab

lsns — list every namespace

lsns                                 # all of them
lsns --type net                      # just network namespaces
lsns -p 1                            # what namespaces does PID 1 belong to

Namespaces Are Not Security

Namespaces isolate resource views. They do not by themselves sandbox a process. A process with CAP_SYS_ADMIN inside a mount namespace can mount arbitrary things. A process with CAP_NET_ADMIN in a network namespace can manipulate routes, and if it can see the host network (shared namespace), it can manipulate the host's. A process that can see /proc or /sys paths outside its namespace can read or write them.

Containers are secure because namespaces are combined with:

  • Capabilities — dropping CAP_SYS_ADMIN, CAP_NET_ADMIN, etc. from the container's process.
  • seccomp — a BPF filter that blocks specific syscalls (like mount, reboot, kexec_load).
  • LSMs — AppArmor or SELinux policies that restrict what the process can touch.
  • cgroups — resource limits, preventing one container from starving others.
  • User namespaces — root-in-namespace mapped to unprivileged on the host.
  • Read-only rootfs / no-new-privileges / dropped SUID — belt-and-suspenders.

A container escape is not a "namespace escape" — it is usually finding a syscall or kernel bug that bypasses one of these layers. Namespaces are the enabling primitive, not the security story.

WARNING

Running a container with --privileged disables most of the extra layers: full capability set, no seccomp, writable /sys and /dev, host devices available. The namespaces are still there — so it still "looks" isolated — but a privileged container can trivially escape to the host. Avoid --privileged unless you genuinely need it, and when you do, treat the container's security as identical to running the process directly on the host.


Debugging With Namespaces

A few techniques you will reach for:

# Is this process in the same netns as the host? (are they in the same inode?)
sudo readlink /proc/1/ns/net /proc/$PID/ns/net
# net:[4026531992]            <- if same, shared
# net:[4026532421]            <- if different, isolated

# See every container's network connections from the host
for pid in $(pgrep -f containerd-shim | head); do
  echo "=== PID $pid ==="
  sudo nsenter -t $pid -n ss -tanlp 2>/dev/null
done

# Which container is a mystery process in?
grep -E '^[0-9]+' /proc/$PID/cgroup
# 0::/system.slice/docker-abc123deadbeef.scope

# Compare two processes' namespace memberships
diff <(ls -l /proc/$PID1/ns/ | awk '{print $9, $11}') \
     <(ls -l /proc/$PID2/ns/ | awk '{print $9, $11}')

Key Concepts Summary

  • Namespaces isolate kernel resources. One per kind: mount, PID, network, IPC, UTS, user, cgroup, (time).
  • Processes belong to one namespace per kind. /proc/[pid]/ns/<kind> is a symlink with a unique inode — same inode = same namespace.
  • Containers are processes in custom namespace sets. No virtualization — just the kernel lying consistently.
  • unshare creates; nsenter enters; ip netns manages network namespaces. Learn all three.
  • mnt isolates the mount table; net isolates interfaces and routing; pid isolates the PID numbering; uts isolates the hostname. The others (ipc, user, cgroup, time) matter in specific situations.
  • User namespaces map UIDs. Root inside can be unprivileged outside — the basis of rootless containers.
  • Namespaces are not security alone. They isolate views. Security comes from combining them with capabilities, seccomp, LSMs, cgroups, and rootfs hardening.
  • lsns lists every namespace; NSpid in /proc/[pid]/status shows nested PIDs. Core inspection tools.

Common Mistakes

  • Treating namespaces as virtualization. They are view isolation — the kernel is still shared, and a bug in one namespace's kernel path affects everyone.
  • Assuming the PID inside a container is the same as the host-side PID. They are almost always different — use NSpid: in /proc/[pid]/status or docker inspect to translate.
  • Debugging container networking from the host without nsenter. You look at the wrong interfaces and get the wrong answer.
  • Running --privileged containers in production. It bypasses the layered security that makes namespaces useful and defeats the whole point.
  • Confusing user namespaces with UIDs. A user namespace is a mapping — a process can be root inside while being unprivileged outside, and that is a feature, not a hack.
  • Thinking "a network namespace needs DNS configuration." The network namespace has nothing but lo until you add interfaces to it — then it inherits whatever routing you set up, including resolver config in the container's /etc/resolv.conf (which comes from the mount namespace).
  • Deleting a network namespace with ip netns del while processes are still inside it. They survive momentarily but lose all network access immediately. Use nsenter to cleanly stop them first.
  • Believing that namespace isolation prevents filesystem access. A mount namespace shares the underlying storage; two containers on the same host can fight over page cache, I/O bandwidth, and disk space if they share storage. Namespaces isolate what you see, not what you share.

KNOWLEDGE CHECK

You have two running processes on a host. `readlink /proc/123/ns/net` returns `net:[4026532421]`, and `readlink /proc/456/ns/net` returns `net:[4026531992]`. What does that tell you, and what practical consequence does it have?