All posts
Kubernetes Debugging

Your Liveness Probe Is Killing Your Pod Mid-Boot. The Six Probe Mistakes That Cause Real Outages.

Liveness probes that fire before your app is ready. Readiness probes that check the database. Exec probes leaking zombie processes by the thousands. The six mistakes that turn health checks into the cause of the outage they were supposed to prevent.

By Sharon Sahadevan··13 min read

A new version of payments-api rolls out at 2 PM. Every replica enters a CrashLoopBackOff. Logs show the app starting, partially initializing, then being killed mid-boot by SIGTERM with no error. Five minutes later it tries again. Same thing.

You check the events:

Warning  Unhealthy  16s    kubelet  Liveness probe failed: HTTP 503
Warning  Killing    12s    kubelet  Container payments-api failed liveness probe, will be restarted

Your liveness probe is killing the pod before it finishes loading the cache, opening database connections, and warming up. The probe sees a 503, decides "unhealthy," restarts. The restart hits the same wall. Crash loop.

Three years ago you copied the probe config from the deployment template repository. Nobody touched it since. Today it does not match how long this service takes to start. Every pod is dying in the same place.

This is one of six probe configuration mistakes that turn health checks into the cause of outages. This post is the catalog: what each probe actually does, the six wrong-by-default patterns, and the right configuration per service type.

What the three probes actually do#

Kubernetes pods can have three independent probes, and the difference matters more than most engineers realize.

Liveness probe: "is the container still alive enough to keep running?" Failures cause the kubelet to restart the container. Designed to recover from deadlocks, hung loops, or unrecoverable internal state.

Readiness probe: "is the container ready to receive traffic?" Failures cause the kubelet to remove the pod's IP from Service endpoints. Pod stays running. Designed for "warming up" and "temporarily overloaded."

Startup probe: "is the container done starting up?" Failures cause restart, like liveness. But while the startup probe runs, liveness and readiness probes are suspended. Designed for slow-booting apps to get a longer initial grace period.

The relationship between them: startup runs first; once it succeeds, liveness and readiness take over. If startup never succeeds within failureThreshold * periodSeconds, the pod is killed and retried.

This three-probe model exists because the failure modes are genuinely different. Mixing them up is mistake #1.

Mistake 1: Using liveness as a "is the app ready" check#

A pod's liveness probe checks GET / and gets a 503 during startup. The kubelet kills the pod. Crash loop.

The fix is not to "make liveness more lenient" (people often raise initialDelaySeconds to 60, then 120, then 300). The fix is to use a startup probe for the warming-up window:

# WRONG: liveness has to handle slow start
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 120   # workaround for slow boot
  periodSeconds: 10
  failureThreshold: 3

# RIGHT: startup probe handles slow start, liveness is fast
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 60       # 60 * 5s = 5 minutes max startup
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  # no initialDelaySeconds needed; startup probe handles boot

After startup succeeds, liveness probe runs at full speed (10s period). If startup fails to complete within 5 minutes, the pod is restarted.

The initialDelaySeconds workaround is fragile: too short and you crash; too long and you hide bugs (a deadlocked process won't be killed for 5 minutes). The startup probe fixes both.

Mistake 2: Liveness probe checks dependencies#

Your liveness probe hits an endpoint that connects to the database to verify "the app can serve traffic." The database has a brief hiccup (failover, network partition, connection pool exhaustion). The probe fails. The kubelet kills the app. Now you have a thundering herd of restarting pods all trying to reconnect to the recovering database, making the recovery slower.

Liveness should only check if the process itself is alive. Not its dependencies. The right kind of check:

  • A handler that returns 200 if the main event loop is responsive (unblocked, not deadlocked).
  • A simple in-process counter that has incremented since the last probe.
  • A heartbeat that the main work loop updates.

Not:

  • A query against the database.
  • A call to a downstream microservice.
  • A check that consumes a Kafka message.

If the database is down, the right behavior is to keep the app running, return errors to clients (or queue work, depending on the service), and recover when the database recovers. Killing the app makes things worse.

Readiness probes are the right place for dependency checks (because failure removes the pod from Service endpoints temporarily, instead of restarting the pod):

livenessProbe:
  httpGet:
    path: /healthz/live      # process-only check
    port: 8080

readinessProbe:
  httpGet:
    path: /healthz/ready     # checks DB, downstream services, etc.
    port: 8080

Two endpoints, two purposes. Most modern frameworks have separate hooks for these.

Mistake 3: Exec probes leaking zombie processes#

The most insidious probe bug. An exec probe runs a command inside the container:

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - 'curl -fs http://localhost:8080/healthz'
  periodSeconds: 10

Every 10 seconds, the kubelet starts a new shell process inside the container, runs curl, and waits for it to exit. If the container's PID 1 (the main app) does not properly reap child processes, every probe leaves a zombie behind.

After a few hours, the container has thousands of zombie processes. The kernel runs out of PIDs in the cgroup (pids.max), and either the container can no longer fork (so any new exec probe fails, restarting the container) or the kernel kills random processes including the main app.

The fix:

Option A: don't use exec probes. HTTP and TCP probes don't fork inside the container; the kubelet does the network call from outside. Most apps can expose an HTTP health endpoint instead.

Option B: use a proper init system as PID 1 that reaps zombies. tini is the canonical choice:

RUN apt-get install -y tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["/usr/local/bin/myapp"]

Or for distroless:

COPY --from=tini /tini /tini
ENTRYPOINT ["/tini", "--"]
CMD ["/myapp"]

tini reaps zombies. With it, exec probes are safe.

Option C: explicit shareProcessNamespace: true on the pod spec, which makes the pause container the PID 1 for all containers (and pause does reap). Less common but works.

Mistake 4: Tight intervals plus long timeout equals thundering kill#

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 1           # every 1 second
  timeoutSeconds: 30         # wait 30s before declaring failure
  failureThreshold: 1        # one failure = kill

Looks aggressive but careful. In practice: 1-second period with 30-second timeout means up to 30 probes can be in flight simultaneously, all waiting on a slow endpoint. If the endpoint hangs, the kubelet has 30 hung probes plus a fresh one starting every second. The container's network or thread pool gets saturated by probe load alone.

The right ratio: periodSeconds > timeoutSeconds, ideally with failureThreshold >= 3 so a transient blip doesn't kill the pod.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3
  # at most 3 probes in flight; 30s of failures before kill

Single transient slow response: probe times out, fails 1, next probe succeeds, counter resets. Persistent failure: 3 consecutive timeouts (~30 seconds) before kill. Right amount of patience.

Mistake 5: Readiness check that ignores graceful shutdown#

When a pod is being terminated, Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds (default 30s), then SIGKILL. The point of the grace period: existing requests finish, new requests stop coming.

But "new requests stop coming" only happens if the readiness probe starts failing during termination, which removes the pod from Service endpoints. Many apps don't fail their readiness probe during shutdown; they keep saying "ready" until they get killed. Result: kube-proxy still routes new requests to a dying pod.

The right pattern: a preStop hook that flips an internal flag, makes readiness fail, then waits long enough for kube-proxy to update endpoints (typically 5-15 seconds):

spec:
  containers:
    - name: app
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - 'curl -X POST http://localhost:8080/admin/shutdown && sleep 15'
      readinessProbe:
        httpGet:
          path: /healthz/ready
          port: 8080
        # ready endpoint should return 503 once shutdown started
  terminationGracePeriodSeconds: 30

The application's /admin/shutdown flips a flag; /healthz/ready checks the flag and returns 503 after the flip. The 15-second sleep gives kube-proxy time to remove the pod from endpoints. Then SIGTERM arrives and the app shuts down cleanly.

Without this, every deploy briefly serves errors as connections in flight to terminating pods get cut.

Mistake 6: One endpoint for all three probes#

startupProbe:
  httpGet:
    path: /health
    port: 8080
livenessProbe:
  httpGet:
    path: /health
    port: 8080
readinessProbe:
  httpGet:
    path: /health
    port: 8080

The problem: the three probes have completely different semantics, but the same endpoint cannot represent all of them.

  • During startup, the endpoint should return 503 until init is done. (Startup wants this; readiness wants this; liveness must NOT see this as a failure.)
  • During normal operation, the endpoint should return 200 if the process is alive (liveness) AND ready for traffic (readiness).
  • During degradation (DB down), the endpoint should return 503 for readiness but 200 for liveness.
  • During shutdown, the endpoint should return 503 for readiness but 200 for liveness, until SIGKILL.

A single endpoint cannot serve all of these. Some teams paper over by mapping it to "is the process alive" (always 200) and skipping readiness entirely; others map it to readiness and accept that liveness will sometimes restart slow-database scenarios.

The right answer: three separate endpoints, each implementing its specific check.

startupProbe:
  httpGet:
    path: /healthz/started   # 200 once init done
    port: 8080
livenessProbe:
  httpGet:
    path: /healthz/live      # 200 if process responsive
    port: 8080
readinessProbe:
  httpGet:
    path: /healthz/ready     # 200 if ready to serve (init done, deps OK, not shutting down)
    port: 8080

Modern frameworks (Spring Boot Actuator, ASP.NET Core HealthChecks, FastAPI's add_api_route, etc.) make these trivial to implement. Most have built-in support for the trio.

The right pattern per service type#

Different service classes need different probe configs. Three common patterns.

Stateless HTTP service (typical microservice)#

startupProbe:
  httpGet:
    path: /healthz/started
    port: 8080
  periodSeconds: 5
  failureThreshold: 30        # 150s max startup

livenessProbe:
  httpGet:
    path: /healthz/live       # process-only check, no deps
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /healthz/ready      # process + critical deps
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 2

Database / stateful workload#

Liveness should be very forgiving. A database catching up on replication, vacuuming, or recovering should not be restarted casually.

startupProbe:
  exec:
    command: ["/usr/local/bin/pg_isready", "-U", "postgres"]
  periodSeconds: 10
  failureThreshold: 60        # up to 10 minutes for slow recovery

livenessProbe:
  exec:
    command: ["/usr/local/bin/pg_isready", "-U", "postgres"]
  periodSeconds: 30           # slower than HTTP; database is heavier
  timeoutSeconds: 10
  failureThreshold: 5         # 2.5 minutes of failures before restart

readinessProbe:
  exec:
    command: ["/usr/local/bin/pg_isready", "-U", "postgres", "-h", "localhost"]
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

(Use tini or process-namespace-sharing to handle the exec zombies.)

Queue worker (no inbound traffic)#

No readiness probe (no Service in front). Liveness based on heartbeat the worker emits.

livenessProbe:
  httpGet:
    path: /healthz/live       # worker exposes /healthz/live on a small HTTP server
    port: 8081                # admin port, separate from work
  periodSeconds: 30
  failureThreshold: 3
# no readinessProbe

Worker updates a "last loop iteration" timestamp; the live endpoint returns 200 if the timestamp is fresh (within last 60s), 503 otherwise. Catches stuck workers without requiring a Service.

Diagnosing probe issues in production#

When pods are crashlooping or failing readiness:

# Step 1: see what's failing
kubectl describe pod $POD | grep -A 5 -E "Liveness|Readiness|Startup"

# Step 2: look at recent events
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$POD --sort-by='.lastTimestamp'

# Step 3: manually run the probe to see what it returns
kubectl exec -it $POD -- curl -v http://localhost:8080/healthz/ready
# Or for exec probes:
kubectl exec -it $POD -- /the/command/the/probe/runs

# Step 4: check the kubelet's view of the probe
kubectl logs -n kube-system $(kubectl get pod -n kube-system \
  -l component=kubelet -o name | head -1) | grep $POD

The kubelet's logs show the actual HTTP status, response time, and any errors from probe execution. Often the truth is "I sent the probe and got connection refused for 30 seconds" while the app logs say "running fine," and the divergence tells you the probe is hitting the wrong port or path.

Quick reference: the probe configuration checklist#

For every Deployment, verify:

1. Three probes, three endpoints (or three commands), three purposes.
   - startup: /healthz/started, suspends others until done
   - liveness: /healthz/live, process-only, never checks deps
   - readiness: /healthz/ready, includes deps + shutdown signaling

2. Slow-starting apps use startupProbe, not initialDelaySeconds on liveness.

3. Liveness does NOT check database, downstream services, or
   anything that can fail independently of the process.

4. periodSeconds > timeoutSeconds, with failureThreshold >= 3.

5. Exec probes either use HTTP/TCP instead, or run with tini as PID 1.

6. preStop hook flips readiness to fail before SIGTERM, with a 15s sleep
   so kube-proxy updates endpoints first.

7. terminationGracePeriodSeconds covers preStop sleep + actual shutdown
   time (typically 30-60s).

8. Manual probe test passes: kubectl exec -- curl /healthz/...

What to monitor#

Prometheus alerts that catch probe-related issues:

# Pods restarting due to liveness failures
increase(kube_pod_container_status_restarts_total[15m]) > 3

# Pods stuck not-ready (readiness failing)
kube_pod_status_ready{condition="false"} == 1
  and on(pod, namespace) kube_pod_info  # exclude completed pods

# Probe failure rate from the kubelet
rate(kubelet_probe_total{result="failed"}[5m]) > 0.1

The third metric (kubelet probe failures) is the leading indicator. If it goes nonzero in production, something is wrong. Most teams don't have it on a dashboard. Add it.

The mental model#

Probes are not optional and they are not interchangeable. Each one defends against a specific failure mode:

  • Liveness: defends against deadlocks and zombie processes inside the container.
  • Readiness: defends against routing traffic to a pod that cannot serve it.
  • Startup: defends against killing slow-booting pods before they get a chance.

Most production probe bugs are from misunderstanding which defends against what. Liveness gets used to check readiness things; readiness gets used to check liveness things; startup is omitted entirely.

Once you separate the three concerns and implement three distinct endpoints (or commands), almost every probe-related outage stops happening. The ten lines of config are some of the highest-leverage configuration in any pod spec.


Probes are one of the foundational topics in the Kubernetes Debugging course, where we cover the kubelet's full pod lifecycle, including how it interacts with the container runtime, sandbox, and CNI. The production patterns (rollout safety, graceful shutdown, and probe SLO design) are part of the Production Kubernetes Operations course.

More in Kubernetes Debugging