Docker & Container Fundamentals

The Root Problem

A security team auditing a Kubernetes cluster found that 87% of pods were running as UID 0. Half of those had CAP_SYS_ADMIN, a quarter had access to /var/run/docker.sock via hostPath, and several had --privileged. The developers' defense: "Containers are isolated, so root inside is fine." The auditor wrote back with three CVE numbers from the past year where that reasoning would have led to full host compromise. The fix was small — add USER to Dockerfiles, set runAsNonRoot: true in pod specs — but the team had to audit every image and every manifest. The root-by-default habit had gone on for years.

Running as root in a container is not safe just because the container is namespaced. The kernel is shared. Bugs in the kernel, in the container runtime, or in applications with excessive capabilities have all produced real container escapes. The right baseline — which costs almost nothing — is to never run as root unless your workload specifically needs it. This lesson explains why, how the common failure modes work, and the specific Dockerfile + runtime patterns that get you to "no root, ever, by default."


Why Root Inside a Container Is Dangerous

The container isolation story in one sentence: the kernel tries to prevent a process from seeing or affecting things outside its namespaces and cgroups. The weakness: it is the same kernel, and every defense is a software check that a sufficiently-capable process in the container can try to bypass.

Dangers of root inside a container:

1. Kernel vulnerabilities

A kernel CVE that allows privilege escalation from user mode often translates directly to container escape. Dirty Pipe (CVE-2022-0847), CVE-2022-0185 (cgroup v1 bug), CVE-2022-4969, runc/CVE-2019-5736 — every year brings new ones. Root in the container has more paths to exploit these than a regular user would.

If the container's process is non-root, many of these exploits require an additional privilege-escalation step. Making attackers work harder is not "secure"; it is defense-in-depth.

2. Excessive Linux capabilities

Docker drops most caps by default, but "the defaults" still include plenty that matter. Root-in-container with Docker's default caps includes:

  • CAP_CHOWN — change file ownership (on bind-mounted host paths, this affects host files)
  • CAP_DAC_OVERRIDE — bypass file permission checks
  • CAP_NET_BIND_SERVICE — bind to port < 1024
  • CAP_SETUID / CAP_SETGID — change user IDs
  • CAP_FOWNER — bypass owner checks
  • CAP_KILL — send signals to other processes
  • CAP_AUDIT_WRITE, CAP_NET_RAW, and a few others

Non-root (UID > 0) still has these capabilities when run under default Docker. The difference is that non-root lacks the "UID 0 bypass" that some kernel paths check. It is a meaningful additional layer.

3. Bind mounts + chown = host file modification

# As root in container, with a bind mount from /etc
docker run -it --rm -v /etc:/hostetc ubuntu bash
# Inside:
chown root:root /hostetc/passwd     # host /etc/passwd ownership changed!

When you bind-mount a host path into a root-in-container, CAP_CHOWN lets you modify ownership on the host. Compromise one container, compromise the shared mount. Non-root containers cannot do this (without CAP_CHOWN).

4. Writable docker.sock

A container with /var/run/docker.sock bind-mounted can spawn new containers with any settings — including --privileged and host path mounts. A non-root user in the container cannot fix this; the socket is the escalation path. But combined with root, it is game over.

5. Images that run as root by default

Most popular base images (ubuntu, debian, alpine, python, node, nginx) default to root. If you do not explicitly set USER in your Dockerfile, your container runs as root. One missing line = root in production.

KEY CONCEPT

Root in a container has more attack surface than non-root, more Linux capabilities by default, and the ability to write to bind-mounted host paths. The mitigations — USER directive in Dockerfiles, runAsNonRoot: true in Kubernetes — are almost free. Not using them is leaving value on the table for nothing.


The USER Directive: The Simple Fix

The baseline pattern:

FROM node:20-slim

# Create an app user and group
RUN groupadd --system --gid 10001 app && \
    useradd --system --uid 10001 --gid app --no-create-home --shell /usr/sbin/nologin app

WORKDIR /app
COPY --chown=app:app . .
RUN npm ci --omit=dev

USER app

CMD ["node", "server.js"]

Key moves:

  • useradd creates a system user with UID over 10000. Using UIDs > 10000 avoids conflicts with host users on bind mounts.
  • --no-create-home and --shell /usr/sbin/nologin — the user has no login shell and no home directory.
  • COPY --chown=app:app sets ownership during copy (no extra layer from a post-copy chown).
  • USER app before CMD — everything after runs as this user.

Verify:

docker build -t myapp .
docker run --rm myapp id
# uid=10001(app) gid=10001(app) groups=10001(app)

Some images have a built-in user

Many modern base images provide a non-root user out of the box:

  • node:20 / node:20-slimnode user (UID 1000)
  • nginx:1.25-alpine (recent) → nginx user, though the default is still root (gotcha)
  • gcr.io/distroless/nodejs20-debian12:nonroot → explicit nonroot user (UID 65532)
  • python:3.11-slim → no default user; must add one
FROM node:20-slim
USER node                                 # use the image's built-in user
WORKDIR /home/node/app
COPY --chown=node:node . .
CMD ["node", "server.js"]

Binding to port < 1024 as non-root

Ports below 1024 historically required root on Linux. Modern options:

  • Don't use low ports in containers. Use port 8080 inside; let the orchestrator or a reverse proxy map 80 → 8080. This is the default pattern.
  • Set net.ipv4.ip_unprivileged_port_start=0 via sysctl — host-level setting that lets any user bind to any port. OK on dedicated container hosts.
  • Grant CAP_NET_BIND_SERVICE to the binary: setcap 'cap_net_bind_service=+ep' /usr/sbin/nginx in the Dockerfile. The binary gets the capability without running as root.
FROM nginx:alpine
RUN setcap 'cap_net_bind_service=+ep' /usr/sbin/nginx && \
    chown -R nginx:nginx /var/cache/nginx /var/log/nginx /etc/nginx/conf.d
USER nginx
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

Kubernetes: runAsNonRoot and securityContext

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 10001
    runAsGroup: 10001
    fsGroup: 10001
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myorg/myapp:v1.2.3
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
    volumeMounts:
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: tmp
    emptyDir: {}

What each piece does:

  • runAsNonRoot: true — kubelet refuses to start the container if the image's user is root. Fails loud rather than silently running as root.
  • runAsUser: 10001 — explicit UID override.
  • allowPrivilegeEscalation: false — sets no_new_privs; prevents SUID binaries from escalating.
  • readOnlyRootFilesystem: true — container filesystem is read-only; writes go to explicit mounts.
  • capabilities.drop: ["ALL"] — start with zero caps, add back only what you need.
  • seccompProfile — apply the default seccomp filter.

This is the Kubernetes-level equivalent of a well-built Dockerfile. Both layers matter: even with a good Dockerfile, securityContext in the pod spec prevents overrides.

PRO TIP

Adopt Pod Security Admission (PSA) or a policy engine (Kyverno, OPA Gatekeeper) to enforce runAsNonRoot: true and allowPrivilegeEscalation: false cluster-wide. Individual teams forget to set these on new manifests; policy enforcement catches every new workload automatically. Standard starting policies like baseline or restricted apply these by default.


Rootless Docker and Podman

Even the Docker daemon traditionally runs as root on the host. Rootless Docker and rootless Podman change this: the container runtime itself runs as a regular user, leveraging user namespaces to make "root in container" map to a host UID like 1000.

Rootless Podman (the cleanest rootless story today)

# As a regular user, no sudo needed
podman run -it --rm alpine sh
# Inside container
id
# uid=0(root) gid=0(root)

# On the host, from another terminal
ps -u $(whoami) | grep sh
# (the shell is running as your host user)

Podman is the preferred rootless runtime for many ops teams — it does not require a daemon, integrates with systemd user units, and does not require root for normal use.

Rootless Docker

# Install the rootless pack
curl -fsSL https://get.docker.com/rootless | sh
# Starts a user-level dockerd

export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock
docker run --rm alpine id
# uid=0(root) gid=0(root) — but mapped to your host UID

Rootless Docker has limitations (some networking features, --privileged does not make sense, some storage drivers) but the gains in security are real. For workloads where "container escape could reach host root" is an unacceptable risk, rootless is the answer.

User namespaces on the Docker daemon

// /etc/docker/daemon.json
{
  "userns-remap": "default"
}

Restarts the daemon. Every container's UID 0 now maps to some high host UID (e.g., 100000). Many bugs with bind-mount ownership get fixed-or-worsened, depending on the workload. Less disruptive than fully rootless, but also less complete.


Common Anti-Patterns

USER root at the end of a Dockerfile

USER app
# ... a bunch of setup ...
USER root   # "because we need to chown this final thing"
CMD ["./server"]

If you switch to root for a specific operation (install a system package, fix permissions) switch back to the non-root user before the final CMD. Otherwise the container runs as root at runtime, defeating the whole point.

Running as root "for dev"

# Dockerfile.dev
FROM node:20
WORKDIR /app
# NO USER directive — runs as root so bind-mounted source isn't permission-locked
CMD ["npm", "run", "dev"]

Accept some permission friction, set a matching UID:

services:
  app:
    build: .
    user: "${UID:-1000}:${GID:-1000}"
    volumes:
      - ./:/app

Export UID=$(id -u) in your shell config or a .envrc, and the container runs as your host user, writing files with your UID, no chown dance.

--privileged for GPU access, mount access, etc.

# WRONG: blanket privileged
docker run --privileged ...

# RIGHT: grant only what you need
docker run --device=/dev/nvidia0 --device=/dev/nvidiactl ...
docker run --cap-add=SYS_ADMIN --security-opt seccomp=unconfined ...  # last resort

For NVIDIA GPUs, the nvidia-container-runtime handles device access without --privileged. For specific device access, use --device=. --privileged is almost always overkill.

Mounting Docker socket into app containers

volumes:
  - /var/run/docker.sock:/var/run/docker.sock

This gives the container the power of the host's Docker daemon. Any escape or compromise turns into host root access. If you need Docker-in-Docker-ish access, use a sidecar pattern with strict RBAC and limit which containers can see the socket. Never in a web app.

WARNING

The combination of root in container + docker.sock mounted is the single most common container breakout vector in real-world pentests. Treat the Docker socket like a root shell — because for anyone with access to it, that is what it is.


A Checklist for Every Image

Before shipping an image to production, verify:

  • USER is set to a non-root user (UID > 10000 ideally).
  • The user has no interactive shell (--shell /usr/sbin/nologin).
  • The user owns the dirs the app writes to (via COPY --chown or chown in the same RUN as setup).
  • No SUID binaries left in the image unnecessarily. find / -perm -4000 2>/dev/null audits.
  • No dev tools or package managers in the final image. Multi-stage builds keep them out.
  • .dockerignore excludes .git, .env, local state, test fixtures.
  • Secrets are not baked into the image (use runtime env or secret mounts).
  • Base image is pinned to a specific tag, ideally a digest.
  • Image is scanned for CVEs (Lesson 5.2).

At runtime (orchestrator-level):

  • runAsNonRoot: true.
  • allowPrivilegeEscalation: false.
  • capabilities.drop: ["ALL"] and add only what's needed.
  • readOnlyRootFilesystem: true with explicit emptyDir mounts for writable paths.
  • seccompProfile: RuntimeDefault (or stricter).
  • Resource limits set (Lesson 5.3).
  • No privileged: true.
  • No host path mounts except where absolutely necessary.

Key Concepts Summary

  • Root in a container is weaker than root on the host — but still dangerous. Capabilities, bind-mounted host files, kernel CVEs, the Docker socket all amplify the risk.
  • Always USER <name> in your Dockerfile. Use UIDs > 10000 to avoid host-user conflicts.
  • In Kubernetes, enforce runAsNonRoot: true. It fails closed on images that default to root.
  • Drop all capabilities by default. Add back only what the app actually uses.
  • readOnlyRootFilesystem: true + explicit emptyDir mounts closes a huge class of tamper attacks.
  • Rootless Docker / Podman runs the runtime as a regular user; a full container escape still lands on an unprivileged host user.
  • Never mount the Docker socket into app containers. It is root-equivalent.
  • --privileged turns off most of the isolation. Use it only for host-level admin tools with strict access control.
  • Policy enforcement (Pod Security Admission, Kyverno, OPA Gatekeeper) catches misconfigurations on every new workload.

Common Mistakes

  • No USER in the Dockerfile → root at runtime. Check with docker run --rm <image> id.
  • Setting USER early and then switching back to root for setup without switching back again.
  • Using UID < 1000 (conflicting with host users on bind mounts).
  • Running --privileged "just to make it work." Find the specific cap / device / sysctl you actually need.
  • Mounting /var/run/docker.sock into app containers. Same as giving them host root.
  • Assuming container isolation replaces kernel patching. Shared kernel = shared attack surface.
  • Relying on Docker's default caps as "secure." The defaults are reasonable but not minimal; drop ALL and add back.
  • Skipping runAsNonRoot: true in Kubernetes. A pod manifest without it lets any root-defaulting image run as root.
  • Binding low ports in containers "because nginx." Use 8080 or setcap, not root.
  • Forgetting that useradd without --no-create-home --shell nologin creates an interactive user and a home dir you do not need.

KNOWLEDGE CHECK

Your team runs Kubernetes. A developer's new pod mounts a hostPath volume for `/var/log/myapp` and does not set `securityContext`. Three months later, a security audit flags the pod as risk. What are the likely findings, and which single policy change would have prevented this entire class of issue at deploy time?