Docker & Container Fundamentals

Container Won't Start

Staging is green. An engineer tags the latest build as v1.2.3, pushes to production, kubectl apply the updated Deployment. Within 30 seconds, every pod is CrashLoopBackOff. The logs say nothing useful — the main process died before it could log anything. The lead's first instinct is "roll back," which is the right reflex. But the follow-up question is "why did this happen?" And now they have three minutes before the rollback finishes to grab evidence: exit code, last stderr, what /proc says about the process. Those three minutes determine whether this is a "fix in the morning" problem or a "3 AM war room" problem.

"Container won't start" collapses into a handful of root causes — wrong command, missing dep, bad config, wrong permissions, missing env var, port already in use. This lesson is the debugging flowchart: what to check in what order, how to interpret exit codes, and how to get a shell into a container that refuses to stay running. When you have done this three or four times the pattern is automatic.


The Order of Operations

Container won't start.

1.  docker ps -a | grep <name>
    └── Is the container there? What state? What exit code?

2.  docker logs <name>
    └── What did the process say before it died?

3.  docker inspect <name>
    └── What config did it actually run with?
        - Command, args, env, volumes, network, user, restart policy

4.  docker events &
    └── Watch the event stream for recurring errors.

5.  Override the entrypoint:  docker run --entrypoint sh ... image:tag
    └── Get a shell; run the real command by hand, see what breaks.

6.  On the host:  ps aux | grep <expected-cmd>
    └── Is the process actually starting? Is it being killed by the kernel?

7.  dmesg | tail
    └── OOM? Segfault? Permission denied? Kernel-level clues.

8.  If nothing above yields clues: rebuild with a debugging image (full shell,
    build tools), run the entrypoint manually step by step.

This is the flowchart. The rest of the lesson drills into each step.


Step 1: State and Exit Code

docker ps -a | grep myapp
# abc123  myorg/myapp:v1.2.3  "/docker-entrypoint…"  10 minutes ago  Exited (1) 10 minutes ago   myapp

Key information in the docker ps -a line:

  • STATUS — current state. Relevant values:
    • Exited (<code>) — the container terminated; look at the exit code.
    • Restarting — auto-restarting (--restart set); it will keep crashing unless fixed.
    • Up — currently running (maybe crashing continuously if --restart is set and the healthcheck hasn't caught up).
    • Dead — failed in a way that removed it from the active set; inspect for details.
  • Exit code — the important number.

Exit code cheat sheet

Exit codeWhat it means
0Clean exit
1Generic error; the application returned 1
2Shell misuse or missing file (classic "file not found")
125Docker daemon error (something wrong with the command you gave Docker)
126Container command exists but is not executable
127Container command not found
128 + NKilled by signal N (e.g., 130 = 128 + 2 = SIGINT, 137 = SIGKILL, 139 = SIGSEGV, 143 = SIGTERM)

Special cases:

  • 137 + OOMKilled: true — the container exceeded its memory limit; the kernel killed it. Fix the limit or the memory leak.
  • 137 with OOMKilled: false — external SIGKILL, usually from a docker kill or a kubectl delete --grace-period=0.
  • 143 — clean SIGTERM; the container handled shutdown. Normal on docker stop.
  • 126 / 127 — typically a bad Dockerfile: CMD ["./missing-binary"] or missing exec bit on a script.
# Get the OOMKilled flag
docker inspect myapp --format='{{.State.OOMKilled}}'
# true    → memory limit hit
# false   → external signal
PRO TIP

Exit code is the fastest "what category of problem is this" signal. Write it on your mental cheat sheet: 137 = SIGKILL (OOM or forced), 139 = segfault, 143 = SIGTERM (clean), 1 = app error, 127 = command not found, 125 = bad docker args. From one number you narrow the debug path from "what is happening" to "this specific class of thing happened."


Step 2: Logs

docker logs myapp

# With timestamps
docker logs --timestamps myapp

# Only the last N lines
docker logs --tail 50 myapp

# Follow (live)
docker logs -f myapp

# Only since a time
docker logs --since "5 minutes ago" myapp

The usual pattern: the last few lines of the log tell you what killed the process. If logs are empty, the process died before it could produce any output (bad entrypoint, missing binary, permission denied on exec).

"My logs are empty"

docker logs myapp
# (nothing)

Possible causes:

  1. The process died before logging anything. Check exit code; run the entrypoint manually (step 5).
  2. The app writes to a file, not stdout. Look inside the container (docker exec if it is running, or docker run --rm --entrypoint sh to poke the image).
  3. The log driver is set to something other than json-file. docker inspect --format='{{.HostConfig.LogConfig}}' myapp. If it is syslog, journald, gelf, awslogs, fluentd, etc., the logs are in that system, not in Docker.
  4. Kubernetes: kubectl logs -p to see the previous container's logs after a crash loop.

"Docker is producing GB of logs"

docker logs myapp --tail 10
# Last 10 lines only — avoid pulling hundreds of megabytes over ssh

# Check log size on the host
ls -lh /var/lib/docker/containers/<container-id>/*.log

# Truncate (while container keeps running)
sudo sh -c ': > /var/lib/docker/containers/<container-id>/<container-id>-json.log'

# Long-term fix: log rotation in /etc/docker/daemon.json
# {
#   "log-driver": "json-file",
#   "log-opts": { "max-size": "10m", "max-file": "5" }
# }
# restart the docker daemon; new containers pick this up

Step 3: Inspect — What Config Is Actually Applied

docker inspect myapp
# Full JSON dump

# Key fields:
docker inspect myapp --format='{{.State.Error}}'
# (empty or a string like "OCI runtime exec failed: exec: \"./missing.sh\": stat ./missing.sh: no such file or directory: unknown")

docker inspect myapp --format='{{.Config.Cmd}}'
# [./run.sh]

docker inspect myapp --format='{{.Config.Entrypoint}}'
# [/docker-entrypoint.sh]

docker inspect myapp --format='{{json .Config.Env}}' | jq
# ["PATH=/usr/local/sbin:...", "NODE_ENV=production", "DATABASE_URL=...", ...]

docker inspect myapp --format='{{json .Mounts}}' | jq
# [{"Type":"bind","Source":"/etc/myapp.conf","Destination":"/etc/app/config.yaml",...}]

docker inspect myapp --format='{{json .HostConfig.RestartPolicy}}'
# {"Name":"unless-stopped","MaximumRetryCount":0}

Common "aha" moments in inspect output:

  • Environment variable missing or mistyped. Your .env had DATABASEURL instead of DATABASE_URL.
  • Volume mount points to a path that exists on the host but not in the image. Container process can't find the file it expects.
  • User is root but image expected non-root. Permission errors on the application's data path.
  • Restart policy is no. Container is not auto-restarting and you have been looking at its stopped state for 20 minutes.
  • Command is "sleep infinity". Someone left a debug command in the manifest.

Kubernetes equivalent

kubectl describe pod mypod

# Look for:
# State: Waiting / Running / Terminated
#   Reason: CrashLoopBackOff / ImagePullBackOff / OOMKilled
#   Last State: (previous container's state — what killed it last time)
# Events: at the bottom — chronological list of what the kubelet tried and what happened

The Last State + Reason combination is the single most useful piece of info when a pod is in CrashLoopBackOff.


Step 4: Events Stream

docker events
# Live stream of lifecycle events for ALL containers, networks, volumes
# 2026-04-20T10:00:00.123Z container create abc123 ...
# 2026-04-20T10:00:00.456Z container start  abc123 ...
# 2026-04-20T10:00:02.789Z container die    abc123 (exitCode=137)
# 2026-04-20T10:00:03.012Z container destroy abc123 ...
# (then the auto-restart loop starts over)

# Filter
docker events --filter container=myapp

# Recent past
docker events --since "5 minutes ago"

# Only specific events
docker events --filter event=die --filter event=start

docker events & in another terminal while you debug is a great way to see exactly when the container dies and what immediately preceded it. The output includes exit codes on die events.


Step 5: Override the Entrypoint (The Killer Move)

When the container crashes so fast there are no logs, the trick is to run the image with something that does not crash — typically sh — and then try the entrypoint manually.

# Instead of the image's normal entrypoint
docker run --rm -it --entrypoint sh myorg/myapp:v1.2.3

# Inside the container
ls /                            # is the rootfs what I expected?
cat /etc/app/config.yaml         # is the config there? is it readable?
id                               # who am I running as?
ls -l /app                       # can my user read the app files?
./run.sh                         # try running the entrypoint manually
# Ah — "Error: cannot open /var/lib/app/data: Permission denied"

This almost always finds the problem. The container that crashed now has you inside it, with the same filesystem and env, and you can step through the entrypoint until something fails.

If the image has no shell

Distroless images (gcr.io/distroless/...) do not include a shell. You cannot --entrypoint sh. Options:

  • Use the :debug variant of the distroless image — it includes busybox.
    docker run --rm -it --entrypoint sh gcr.io/distroless/static-debian12:debug
  • Use nsenter from the host to drop into the namespaces of a (briefly) running container. Race condition, but sometimes works.
  • Copy the image's rootfs out and inspect from the host:
    CONTAINER=$(docker create myorg/myapp:v1.2.3)   # creates but does not start
    docker export $CONTAINER | tar -t | head         # peek at filesystem
    docker cp $CONTAINER:/app /tmp/app-image-files   # pull out files
    docker rm $CONTAINER
PRO TIP

Keep a debug Dockerfile variant of your production image that FROMs the prod image and adds a shell + debugging tools (bash, strace, curl, netstat). When a prod container will not start, swap the image tag to the debug variant to get a shell, find the root cause, fix, and go back to the slim production image.


Step 6 and 7: Host-Side Forensics

If the container is in a Restarting state with --restart=always or unless-stopped, you can sometimes catch it mid-run:

# Watch for the container, capture its PID in the brief window it lives
while true; do
  PID=$(docker inspect --format='{{.State.Pid}}' myapp 2>/dev/null)
  if [ "$PID" != "0" ] && [ -n "$PID" ]; then
    echo "PID: $PID"
    cat /proc/$PID/cmdline | tr '\0' ' '
    cat /proc/$PID/status | grep -E '^(State|VmRSS|Uid|Gid)'
    break
  fi
  sleep 0.1
done

And check dmesg for kernel-level clues:

sudo dmesg -T | tail -50 | grep -E 'oom|segfault|killed process|i/o error|veth|docker'
# [Sat Apr 20 10:00:02 2026] Memory cgroup out of memory: Killed process 12345 (node) ...
# [Sat Apr 20 10:00:02 2026] node[12345]: segfault at 0 ip 00007f... rip 00007f... error 6 in libc.so.6

This reveals:

  • OOM kills (including cgroup OOM — essential if OOMKilled: true in inspect).
  • Segfaults (with fault addresses).
  • Kernel-level permission denials.
  • Device errors.

Common Patterns

1. executable file not found in $PATH (exit 127)

docker: Error response from daemon: ... exec: "./run.sh": stat ./run.sh: no such file or directory

Causes:

  • The binary is at a different path in the image than you thought.
  • You are using shell form CMD run.sh — it tries to execute run.sh via PATH, but your CWD may not be in PATH.
  • A multi-stage build copied binaries to an unexpected location.

Fix: check paths with --entrypoint sh + which, and use absolute paths in exec-form CMD/ENTRYPOINT.

2. permission denied (exit 126 or 1)

/docker-entrypoint.sh: line 1: exec: ./run: Permission denied

Causes:

  • Script missing exec bit. chmod +x in the Dockerfile or in git.
  • Mounted volume with wrong ownership — container's user cannot access the file.
  • SELinux / AppArmor blocking the exec.

Fix: ls -l /path/to/entry inside the container (via --entrypoint sh), check ownership and mode.

3. App crashes immediately after printing one line (exit 1)

Error: DATABASE_URL is not set

Causes:

  • Env var missing from -e / env_file / Kubernetes env.
  • .env file not loaded (Compose only auto-loads .env from the compose file's dir).
  • Typo in variable name.

Fix: docker inspect --format='{{.Config.Env}}' myapp to see actual env.

4. "Bind port already in use"

docker: Error response from daemon: ... bind for 0.0.0.0:8080 failed: port is already allocated

Causes:

  • Another container already publishing that port.
  • Host service already listening.

Fix: sudo ss -tlnp | grep :8080 to see what's on it. Choose a different host port or stop the other process.

5. Healthcheck failing silently

docker inspect --format='{{.State.Health.Status}}' myapp
# unhealthy

docker inspect --format='{{json .State.Health.Log}}' myapp | jq '.[-1]'
# {"Start":"...","End":"...","ExitCode":1,"Output":"curl: (7) Failed to connect to localhost port 8080: Connection refused"}

Causes:

  • App listens on a different interface (e.g., only 127.0.0.1:8080 when healthcheck uses curl from the container — actually same netns, so fine; but sometimes on an IPv6-only bind, etc.).
  • Port mismatch (app listens 3000, healthcheck hits 8080).
  • App is slow to start; start_period too short.

Fix: run the healthcheck command manually inside the container. Adjust start_period.

6. Immediate OOMKilled at startup

docker inspect myapp --format='{{.State.OOMKilled}}'
# true

Causes:

  • Memory limit too low for the runtime's minimum (e.g., 128 MB for a JVM).
  • Runtime's heap allocation exceeds container's limit.
  • Large allocation early in startup (big config load, warm-up routine).

Fix: raise limit, tune runtime (e.g., JAVA_OPTS='-Xmx256m' + container limit 512 MB).

7. Container starts fine locally but crashes in production

Different environments, different behaviors. Common culprits:

  • Missing env vars in the production config.
  • Permissions on bind-mounted paths different in production (host users differ).
  • Image architecture mismatch — you built for amd64, deployed on arm64 (Graviton, Apple Silicon nodes). Symptom: exit 1 with "exec format error" in logs.
  • Registry pull failure — wrong credentials, rate limit, network policy blocking.

A Full Debug Session in Practice

# 1. State
docker ps -a | grep myapp
# Exited (137) 2 minutes ago — running in a restart loop

# 2. Exit code + OOM
docker inspect myapp --format='{{.State.ExitCode}} {{.State.OOMKilled}}'
# 137 true   → cgroup OOM kill

# 3. Logs (since the last restart)
docker logs --tail 50 myapp
# ... Starting nginx ...
# (nothing else — it got killed mid-startup)

# 4. dmesg for confirmation
sudo dmesg -T | tail -20 | grep -i oom
# Memory cgroup out of memory: Killed process 12345 (nginx) total-vm:500M anon-rss:450M ...

# 5. Check the limit
docker inspect myapp --format='{{.HostConfig.Memory}}'
# 134217728    → 128 MiB (too low for what nginx and its workers want)

# 6. Fix: raise the limit
docker update --memory=512m myapp

# 7. Restart cleanly
docker restart myapp

# 8. Verify
docker logs -f myapp
# ... running normally ...

# 9. Make it permanent (in compose / k8s)

10 minutes, root cause identified, fix applied, permanent remediation in flight.


Kubernetes-Specific Flow

# Step 1: pod state
kubectl get pod mypod -o wide
# NAME    READY   STATUS             RESTARTS   AGE
# mypod   0/1     CrashLoopBackOff   5          2m

# Step 2: describe
kubectl describe pod mypod | tail -40
# ... Events: OOMKilled, Last State: Terminated(137), ...

# Step 3: logs (and previous container's logs if crashing)
kubectl logs mypod
kubectl logs mypod -p        # --previous

# Step 4: exec if the container is up long enough
kubectl exec -it mypod -- sh

# Step 5: debug container (K8s 1.23+) — run a sidecar with the same namespace
kubectl debug mypod -it --image=busybox --target=app

# Step 6: ephemeral container with full shell
kubectl debug -it mypod --image=nicolaka/netshoot --target=app

kubectl debug is the killer move in Kubernetes: inject a new container into a running pod that shares the target container's network and process namespaces, giving you a full shell without modifying the original image.


Key Concepts Summary

  • Exit code is the first signal. 137 OOM / 139 segfault / 143 clean SIGTERM / 1 app error / 127 not-found.
  • docker logs + docker inspect cover 80% of container-won't-start problems.
  • --entrypoint sh is the escape hatch when the image crashes too fast to log.
  • docker events shows the live lifecycle stream; great for catching restart loops.
  • dmesg reveals kernel-level kills (OOM, segfault, permission).
  • On Kubernetes, kubectl describe pod + kubectl logs -p are the equivalent first two tools.
  • kubectl debug lets you drop a debugging container into a pod without changing its image.
  • Distroless debugging: use the :debug tag or pull files out with docker cp.
  • Host-side debugging via /proc/<pid>/status captures state when the container is too short-lived to exec into.
  • Always ship a debug variant of your production image for fast diagnosis in prod.

Common Mistakes

  • Looking at docker logs first for an empty container. Exit code comes first — empty logs + exit 127 = command not found, not a logging issue.
  • Restarting the container and hoping it works. Without fixing the root cause, you'll be back in five minutes.
  • Forgetting kubectl logs -p after a crash loop. Current container has no logs yet; the dead one does.
  • Ignoring OOMKilled: true and blaming the app. The kernel killed it because the limit was too low. Fix limits or runtime.
  • Not using docker events when restarts are happening. It pinpoints exactly when each die/start occurs.
  • Overriding --entrypoint without also removing image's CMD arguments, which end up as positional args to sh.
  • Using docker exec on a container that is not running — it fails with an unhelpful error. docker ps first to confirm Up state.
  • Running debug commands directly on a production container without a safety net. For high-risk actions, docker commit first so you have an image snapshot to rebase from.
  • Relying on logs only. docker inspect often shows a misconfiguration the logs can never reveal (wrong mount, wrong env, wrong user).
  • Forgetting platform mismatch. "Works on my M1 Mac, crashes in prod" often = amd64-only image.

KNOWLEDGE CHECK

A Kubernetes pod is in `CrashLoopBackOff`. `kubectl logs mypod` returns empty. `kubectl describe pod mypod` shows `Last State: Terminated, Reason: Error, Exit Code: 127`. What is the most likely cause, and what is your next step?