Container Won't Start
Staging is green. An engineer tags the latest build as
v1.2.3, pushes to production,kubectl applythe updated Deployment. Within 30 seconds, every pod isCrashLoopBackOff. The logs say nothing useful — the main process died before it could log anything. The lead's first instinct is "roll back," which is the right reflex. But the follow-up question is "why did this happen?" And now they have three minutes before the rollback finishes to grab evidence: exit code, last stderr, what/procsays about the process. Those three minutes determine whether this is a "fix in the morning" problem or a "3 AM war room" problem."Container won't start" collapses into a handful of root causes — wrong command, missing dep, bad config, wrong permissions, missing env var, port already in use. This lesson is the debugging flowchart: what to check in what order, how to interpret exit codes, and how to get a shell into a container that refuses to stay running. When you have done this three or four times the pattern is automatic.
The Order of Operations
Container won't start.
1. docker ps -a | grep <name>
└── Is the container there? What state? What exit code?
2. docker logs <name>
└── What did the process say before it died?
3. docker inspect <name>
└── What config did it actually run with?
- Command, args, env, volumes, network, user, restart policy
4. docker events &
└── Watch the event stream for recurring errors.
5. Override the entrypoint: docker run --entrypoint sh ... image:tag
└── Get a shell; run the real command by hand, see what breaks.
6. On the host: ps aux | grep <expected-cmd>
└── Is the process actually starting? Is it being killed by the kernel?
7. dmesg | tail
└── OOM? Segfault? Permission denied? Kernel-level clues.
8. If nothing above yields clues: rebuild with a debugging image (full shell,
build tools), run the entrypoint manually step by step.
This is the flowchart. The rest of the lesson drills into each step.
Step 1: State and Exit Code
docker ps -a | grep myapp
# abc123 myorg/myapp:v1.2.3 "/docker-entrypoint…" 10 minutes ago Exited (1) 10 minutes ago myapp
Key information in the docker ps -a line:
- STATUS — current state. Relevant values:
Exited (<code>)— the container terminated; look at the exit code.Restarting— auto-restarting (--restartset); it will keep crashing unless fixed.Up— currently running (maybe crashing continuously if--restartis set and the healthcheck hasn't caught up).Dead— failed in a way that removed it from the active set; inspect for details.
- Exit code — the important number.
Exit code cheat sheet
| Exit code | What it means |
|---|---|
| 0 | Clean exit |
| 1 | Generic error; the application returned 1 |
| 2 | Shell misuse or missing file (classic "file not found") |
| 125 | Docker daemon error (something wrong with the command you gave Docker) |
| 126 | Container command exists but is not executable |
| 127 | Container command not found |
| 128 + N | Killed by signal N (e.g., 130 = 128 + 2 = SIGINT, 137 = SIGKILL, 139 = SIGSEGV, 143 = SIGTERM) |
Special cases:
- 137 +
OOMKilled: true— the container exceeded its memory limit; the kernel killed it. Fix the limit or the memory leak. - 137 with
OOMKilled: false— external SIGKILL, usually from adocker killor akubectl delete --grace-period=0. - 143 — clean SIGTERM; the container handled shutdown. Normal on
docker stop. - 126 / 127 — typically a bad Dockerfile:
CMD ["./missing-binary"]or missing exec bit on a script.
# Get the OOMKilled flag
docker inspect myapp --format='{{.State.OOMKilled}}'
# true → memory limit hit
# false → external signal
Exit code is the fastest "what category of problem is this" signal. Write it on your mental cheat sheet: 137 = SIGKILL (OOM or forced), 139 = segfault, 143 = SIGTERM (clean), 1 = app error, 127 = command not found, 125 = bad docker args. From one number you narrow the debug path from "what is happening" to "this specific class of thing happened."
Step 2: Logs
docker logs myapp
# With timestamps
docker logs --timestamps myapp
# Only the last N lines
docker logs --tail 50 myapp
# Follow (live)
docker logs -f myapp
# Only since a time
docker logs --since "5 minutes ago" myapp
The usual pattern: the last few lines of the log tell you what killed the process. If logs are empty, the process died before it could produce any output (bad entrypoint, missing binary, permission denied on exec).
"My logs are empty"
docker logs myapp
# (nothing)
Possible causes:
- The process died before logging anything. Check exit code; run the entrypoint manually (step 5).
- The app writes to a file, not stdout. Look inside the container (
docker execif it is running, ordocker run --rm --entrypoint shto poke the image). - The log driver is set to something other than
json-file.docker inspect --format='{{.HostConfig.LogConfig}}' myapp. If it issyslog,journald,gelf,awslogs,fluentd, etc., the logs are in that system, not in Docker. - Kubernetes:
kubectl logs -pto see the previous container's logs after a crash loop.
"Docker is producing GB of logs"
docker logs myapp --tail 10
# Last 10 lines only — avoid pulling hundreds of megabytes over ssh
# Check log size on the host
ls -lh /var/lib/docker/containers/<container-id>/*.log
# Truncate (while container keeps running)
sudo sh -c ': > /var/lib/docker/containers/<container-id>/<container-id>-json.log'
# Long-term fix: log rotation in /etc/docker/daemon.json
# {
# "log-driver": "json-file",
# "log-opts": { "max-size": "10m", "max-file": "5" }
# }
# restart the docker daemon; new containers pick this up
Step 3: Inspect — What Config Is Actually Applied
docker inspect myapp
# Full JSON dump
# Key fields:
docker inspect myapp --format='{{.State.Error}}'
# (empty or a string like "OCI runtime exec failed: exec: \"./missing.sh\": stat ./missing.sh: no such file or directory: unknown")
docker inspect myapp --format='{{.Config.Cmd}}'
# [./run.sh]
docker inspect myapp --format='{{.Config.Entrypoint}}'
# [/docker-entrypoint.sh]
docker inspect myapp --format='{{json .Config.Env}}' | jq
# ["PATH=/usr/local/sbin:...", "NODE_ENV=production", "DATABASE_URL=...", ...]
docker inspect myapp --format='{{json .Mounts}}' | jq
# [{"Type":"bind","Source":"/etc/myapp.conf","Destination":"/etc/app/config.yaml",...}]
docker inspect myapp --format='{{json .HostConfig.RestartPolicy}}'
# {"Name":"unless-stopped","MaximumRetryCount":0}
Common "aha" moments in inspect output:
- Environment variable missing or mistyped. Your
.envhadDATABASEURLinstead ofDATABASE_URL. - Volume mount points to a path that exists on the host but not in the image. Container process can't find the file it expects.
- User is root but image expected non-root. Permission errors on the application's data path.
- Restart policy is
no. Container is not auto-restarting and you have been looking at its stopped state for 20 minutes. - Command is "sleep infinity". Someone left a debug command in the manifest.
Kubernetes equivalent
kubectl describe pod mypod
# Look for:
# State: Waiting / Running / Terminated
# Reason: CrashLoopBackOff / ImagePullBackOff / OOMKilled
# Last State: (previous container's state — what killed it last time)
# Events: at the bottom — chronological list of what the kubelet tried and what happened
The Last State + Reason combination is the single most useful piece of info when a pod is in CrashLoopBackOff.
Step 4: Events Stream
docker events
# Live stream of lifecycle events for ALL containers, networks, volumes
# 2026-04-20T10:00:00.123Z container create abc123 ...
# 2026-04-20T10:00:00.456Z container start abc123 ...
# 2026-04-20T10:00:02.789Z container die abc123 (exitCode=137)
# 2026-04-20T10:00:03.012Z container destroy abc123 ...
# (then the auto-restart loop starts over)
# Filter
docker events --filter container=myapp
# Recent past
docker events --since "5 minutes ago"
# Only specific events
docker events --filter event=die --filter event=start
docker events & in another terminal while you debug is a great way to see exactly when the container dies and what immediately preceded it. The output includes exit codes on die events.
Step 5: Override the Entrypoint (The Killer Move)
When the container crashes so fast there are no logs, the trick is to run the image with something that does not crash — typically sh — and then try the entrypoint manually.
# Instead of the image's normal entrypoint
docker run --rm -it --entrypoint sh myorg/myapp:v1.2.3
# Inside the container
ls / # is the rootfs what I expected?
cat /etc/app/config.yaml # is the config there? is it readable?
id # who am I running as?
ls -l /app # can my user read the app files?
./run.sh # try running the entrypoint manually
# Ah — "Error: cannot open /var/lib/app/data: Permission denied"
This almost always finds the problem. The container that crashed now has you inside it, with the same filesystem and env, and you can step through the entrypoint until something fails.
If the image has no shell
Distroless images (gcr.io/distroless/...) do not include a shell. You cannot --entrypoint sh. Options:
- Use the
:debugvariant of the distroless image — it includes busybox.docker run --rm -it --entrypoint sh gcr.io/distroless/static-debian12:debug - Use
nsenterfrom the host to drop into the namespaces of a (briefly) running container. Race condition, but sometimes works. - Copy the image's rootfs out and inspect from the host:
CONTAINER=$(docker create myorg/myapp:v1.2.3) # creates but does not start docker export $CONTAINER | tar -t | head # peek at filesystem docker cp $CONTAINER:/app /tmp/app-image-files # pull out files docker rm $CONTAINER
Keep a debug Dockerfile variant of your production image that FROMs the prod image and adds a shell + debugging tools (bash, strace, curl, netstat). When a prod container will not start, swap the image tag to the debug variant to get a shell, find the root cause, fix, and go back to the slim production image.
Step 6 and 7: Host-Side Forensics
If the container is in a Restarting state with --restart=always or unless-stopped, you can sometimes catch it mid-run:
# Watch for the container, capture its PID in the brief window it lives
while true; do
PID=$(docker inspect --format='{{.State.Pid}}' myapp 2>/dev/null)
if [ "$PID" != "0" ] && [ -n "$PID" ]; then
echo "PID: $PID"
cat /proc/$PID/cmdline | tr '\0' ' '
cat /proc/$PID/status | grep -E '^(State|VmRSS|Uid|Gid)'
break
fi
sleep 0.1
done
And check dmesg for kernel-level clues:
sudo dmesg -T | tail -50 | grep -E 'oom|segfault|killed process|i/o error|veth|docker'
# [Sat Apr 20 10:00:02 2026] Memory cgroup out of memory: Killed process 12345 (node) ...
# [Sat Apr 20 10:00:02 2026] node[12345]: segfault at 0 ip 00007f... rip 00007f... error 6 in libc.so.6
This reveals:
- OOM kills (including cgroup OOM — essential if
OOMKilled: truein inspect). - Segfaults (with fault addresses).
- Kernel-level permission denials.
- Device errors.
Common Patterns
1. executable file not found in $PATH (exit 127)
docker: Error response from daemon: ... exec: "./run.sh": stat ./run.sh: no such file or directory
Causes:
- The binary is at a different path in the image than you thought.
- You are using shell form
CMD run.sh— it tries to executerun.shvia PATH, but your CWD may not be in PATH. - A multi-stage build copied binaries to an unexpected location.
Fix: check paths with --entrypoint sh + which, and use absolute paths in exec-form CMD/ENTRYPOINT.
2. permission denied (exit 126 or 1)
/docker-entrypoint.sh: line 1: exec: ./run: Permission denied
Causes:
- Script missing exec bit.
chmod +xin the Dockerfile or in git. - Mounted volume with wrong ownership — container's user cannot access the file.
- SELinux / AppArmor blocking the exec.
Fix: ls -l /path/to/entry inside the container (via --entrypoint sh), check ownership and mode.
3. App crashes immediately after printing one line (exit 1)
Error: DATABASE_URL is not set
Causes:
- Env var missing from
-e/env_file/ Kubernetes env. .envfile not loaded (Compose only auto-loads.envfrom the compose file's dir).- Typo in variable name.
Fix: docker inspect --format='{{.Config.Env}}' myapp to see actual env.
4. "Bind port already in use"
docker: Error response from daemon: ... bind for 0.0.0.0:8080 failed: port is already allocated
Causes:
- Another container already publishing that port.
- Host service already listening.
Fix: sudo ss -tlnp | grep :8080 to see what's on it. Choose a different host port or stop the other process.
5. Healthcheck failing silently
docker inspect --format='{{.State.Health.Status}}' myapp
# unhealthy
docker inspect --format='{{json .State.Health.Log}}' myapp | jq '.[-1]'
# {"Start":"...","End":"...","ExitCode":1,"Output":"curl: (7) Failed to connect to localhost port 8080: Connection refused"}
Causes:
- App listens on a different interface (e.g., only 127.0.0.1:8080 when healthcheck uses curl from the container — actually same netns, so fine; but sometimes on an IPv6-only bind, etc.).
- Port mismatch (app listens 3000, healthcheck hits 8080).
- App is slow to start;
start_periodtoo short.
Fix: run the healthcheck command manually inside the container. Adjust start_period.
6. Immediate OOMKilled at startup
docker inspect myapp --format='{{.State.OOMKilled}}'
# true
Causes:
- Memory limit too low for the runtime's minimum (e.g., 128 MB for a JVM).
- Runtime's heap allocation exceeds container's limit.
- Large allocation early in startup (big config load, warm-up routine).
Fix: raise limit, tune runtime (e.g., JAVA_OPTS='-Xmx256m' + container limit 512 MB).
7. Container starts fine locally but crashes in production
Different environments, different behaviors. Common culprits:
- Missing env vars in the production config.
- Permissions on bind-mounted paths different in production (host users differ).
- Image architecture mismatch — you built for amd64, deployed on arm64 (Graviton, Apple Silicon nodes). Symptom: exit 1 with "exec format error" in logs.
- Registry pull failure — wrong credentials, rate limit, network policy blocking.
A Full Debug Session in Practice
# 1. State
docker ps -a | grep myapp
# Exited (137) 2 minutes ago — running in a restart loop
# 2. Exit code + OOM
docker inspect myapp --format='{{.State.ExitCode}} {{.State.OOMKilled}}'
# 137 true → cgroup OOM kill
# 3. Logs (since the last restart)
docker logs --tail 50 myapp
# ... Starting nginx ...
# (nothing else — it got killed mid-startup)
# 4. dmesg for confirmation
sudo dmesg -T | tail -20 | grep -i oom
# Memory cgroup out of memory: Killed process 12345 (nginx) total-vm:500M anon-rss:450M ...
# 5. Check the limit
docker inspect myapp --format='{{.HostConfig.Memory}}'
# 134217728 → 128 MiB (too low for what nginx and its workers want)
# 6. Fix: raise the limit
docker update --memory=512m myapp
# 7. Restart cleanly
docker restart myapp
# 8. Verify
docker logs -f myapp
# ... running normally ...
# 9. Make it permanent (in compose / k8s)
10 minutes, root cause identified, fix applied, permanent remediation in flight.
Kubernetes-Specific Flow
# Step 1: pod state
kubectl get pod mypod -o wide
# NAME READY STATUS RESTARTS AGE
# mypod 0/1 CrashLoopBackOff 5 2m
# Step 2: describe
kubectl describe pod mypod | tail -40
# ... Events: OOMKilled, Last State: Terminated(137), ...
# Step 3: logs (and previous container's logs if crashing)
kubectl logs mypod
kubectl logs mypod -p # --previous
# Step 4: exec if the container is up long enough
kubectl exec -it mypod -- sh
# Step 5: debug container (K8s 1.23+) — run a sidecar with the same namespace
kubectl debug mypod -it --image=busybox --target=app
# Step 6: ephemeral container with full shell
kubectl debug -it mypod --image=nicolaka/netshoot --target=app
kubectl debug is the killer move in Kubernetes: inject a new container into a running pod that shares the target container's network and process namespaces, giving you a full shell without modifying the original image.
Key Concepts Summary
- Exit code is the first signal. 137 OOM / 139 segfault / 143 clean SIGTERM / 1 app error / 127 not-found.
docker logs+docker inspectcover 80% of container-won't-start problems.--entrypoint shis the escape hatch when the image crashes too fast to log.docker eventsshows the live lifecycle stream; great for catching restart loops.dmesgreveals kernel-level kills (OOM, segfault, permission).- On Kubernetes,
kubectl describe pod+kubectl logs -pare the equivalent first two tools. kubectl debuglets you drop a debugging container into a pod without changing its image.- Distroless debugging: use the
:debugtag or pull files out withdocker cp. - Host-side debugging via
/proc/<pid>/statuscaptures state when the container is too short-lived toexecinto. - Always ship a debug variant of your production image for fast diagnosis in prod.
Common Mistakes
- Looking at
docker logsfirst for an empty container. Exit code comes first — empty logs + exit 127 = command not found, not a logging issue. - Restarting the container and hoping it works. Without fixing the root cause, you'll be back in five minutes.
- Forgetting
kubectl logs -pafter a crash loop. Current container has no logs yet; the dead one does. - Ignoring
OOMKilled: trueand blaming the app. The kernel killed it because the limit was too low. Fix limits or runtime. - Not using
docker eventswhen restarts are happening. It pinpoints exactly when each die/start occurs. - Overriding
--entrypointwithout also removing image's CMD arguments, which end up as positional args to sh. - Using
docker execon a container that is not running — it fails with an unhelpful error.docker psfirst to confirm Up state. - Running debug commands directly on a production container without a safety net. For high-risk actions,
docker commitfirst so you have an image snapshot to rebase from. - Relying on logs only.
docker inspectoften shows a misconfiguration the logs can never reveal (wrong mount, wrong env, wrong user). - Forgetting platform mismatch. "Works on my M1 Mac, crashes in prod" often = amd64-only image.
A Kubernetes pod is in `CrashLoopBackOff`. `kubectl logs mypod` returns empty. `kubectl describe pod mypod` shows `Last State: Terminated, Reason: Error, Exit Code: 127`. What is the most likely cause, and what is your next step?