Resource Limits and Health Checks
A startup's API starts dying on production at 3 PM every weekday. Containers OOM, get restarted, and for about 10 minutes traffic is degraded. Monitoring shows plenty of memory on the hosts. Nothing leaks obviously in the app. The eventual discovery: one particular background job, enqueued when analytics teams run end-of-day reports, spikes the API's memory to 4 GB briefly. The containers had no
--memorylimit set, so they kept consuming until the HOST ran out of memory and the kernel OOM killer started picking victims — but it picked the wrong process (the API, not the background job), triggering everything else to fall over too. One-line fix: set--memory=2gon each API container. Now the analytics job OOMs its own container, gets auto-restarted, and the rest of the API is unaffected.Resource limits and health checks are the pair of controls that turn containers from "processes that might crash the host" into "processes the orchestrator can detect and recover." This lesson covers how each works, how the kernel enforces them, what the common failure modes look like, and the HEALTHCHECK patterns that orchestrators actually use to decide whether to ship traffic to your container.
Why Limits Matter
Without limits, a container can consume all of:
- Host memory → kernel OOM killer picks victims (may not be the offender). Other containers on the host get killed.
- Host CPU → your process takes over a core; other processes starve. Noisy-neighbor problems.
- Host disk (via writable layer or volume) → fills the disk, breaks everything on the host.
- Host PIDs → fork bomb in one container kills the whole host (you literally cannot create new processes to recover).
- Host file descriptors → same story at a lower level.
The orchestrator cannot recover what it cannot detect. Limits give the kernel a threshold at which to kill an offender cleanly instead of letting it hurt neighbors.
A container without a memory limit is a container that can take down its host. Set --memory (or the Kubernetes equivalent) on every production container without exception. The specific value matters less than "some value"; start with a generous limit, observe usage, tighten. "Zero limit" is the one setting that is never right.
Memory Limits
# Docker
docker run -d --memory=512m --memory-swap=512m myapp
# Compose
# services:
# app:
# deploy:
# resources:
# limits: { memory: 512M }
# Kubernetes
# spec:
# containers:
# - name: app
# resources:
# limits:
# memory: 512Mi
Under the hood, this writes to memory.max in the container's cgroup (v2). When the container's total memory usage exceeds the limit, the kernel's cgroup OOM killer runs and picks a process in that cgroup to kill.
What OOM looks like
docker run -d --name hungry --memory=200m alpine sh -c 'for i in $(seq 1 10); do dd if=/dev/zero of=/dev/shm/file$i bs=1M count=50; done; sleep 60'
# Watch it die
docker logs hungry
# 50+0 records in
# 50+0 records out
# ...
# dd: can't open '/dev/shm/file5': Cannot allocate memory
# (or the container is killed entirely — depends on how writes accumulate)
docker inspect hungry --format='{{.State.ExitCode}} {{.State.OOMKilled}}'
# 137 true
docker rm hungry
Exit code 137 = 128 + 9 = SIGKILL, and OOMKilled: true in the inspect output. This is the signature of a cgroup OOM kill.
Requests vs limits (Kubernetes)
spec:
containers:
- name: app
resources:
requests:
memory: 256Mi
limits:
memory: 512Mi
- Request is what the scheduler uses to place the pod. The node must have this much free for the pod to be scheduled there.
- Limit is the hard cap at runtime.
Docker has only the limit; Kubernetes splits them, because scheduling across a cluster benefits from knowing each pod's baseline expected usage.
--memory-swap
docker run --memory=512m --memory-swap=512m myapp # no swap, hard 512m cap
docker run --memory=512m --memory-swap=1g myapp # up to 1g with swap; 512m RAM, 512m swap
docker run --memory=512m --memory-swap=-1 myapp # unlimited swap (avoid)
In production, the norm is --memory-swap == --memory (no swap) because swap-thrashing degrades performance in ways worse than OOM-kill. On Kubernetes, swap is historically off; modern clusters optionally allow it but with strong guardrails.
MemoryHigh (soft limit, cgroup v2)
# Not supported directly in Docker; Kubernetes 1.27+ supports memory.high via annotation
Below memory.high, normal operation. Between memory.high and memory.max, kernel actively tries to reclaim memory from the cgroup (slowing it down). Above memory.max, OOM. This soft-threshold behavior is how modern Linux keeps memory-hungry services from dying suddenly — they get throttled first.
A Java service set -Xmx to 4 GB on a container with no limit. The JVM's headroom (direct memory, metaspace, thread stacks) pushed total usage to 5.2 GB. The host eventually OOM-killed the JVM — but also three neighboring containers. Fix: set -Xmx3g and a container limit of 4 GB, giving the JVM headroom and a clear budget. The container-limit layer means a runaway JVM can only destroy itself, not its neighbors. This is the standard pattern: language heap limit < container limit, with container limit < node capacity.
CPU Limits
# 1.5 CPUs of total time
docker run -d --cpus=1.5 myapp
# Pin to specific cores (no time sharing)
docker run -d --cpuset-cpus="0,1" myapp
# Relative weight (under contention)
docker run -d --cpu-shares=512 myapp # default 1024; half priority
# Kubernetes
# spec:
# containers:
# - name: app
# resources:
# requests: { cpu: "500m" } # 0.5 CPU (500 millicpu)
# limits: { cpu: "2" } # 2 CPUs max
Under the hood, --cpus=1.5 writes cpu.max = "150000 100000" — the cgroup can use up to 150 ms of CPU time per 100 ms of wall clock. Exceeding this throttles (does not kill) the process.
CPU Throttling: The Latency Killer
Unlike memory, CPU limits do not OOM. They throttle — the kernel pauses the process at the end of each CFS period if it exceeded its quota. This is invisible in application-level timers; it shows up as latency spikes.
# Inspect throttling
cat /sys/fs/cgroup/system.slice/docker-*.scope/cpu.stat
# usage_usec 12345678
# user_usec 10000000
# system_usec 2345678
# nr_periods 1023
# nr_throttled 48 # this cgroup was throttled 48 times
# throttled_usec 2400000 # total time throttled
nr_throttled climbing on a latency-sensitive service is a loud signal. The common production pattern in Kubernetes:
- Set CPU requests (for scheduling and guaranteed weight under contention)
- Do not set CPU limits on latency-sensitive services — let them burst freely.
This is controversial. The argument: CFS-quota throttling at 100ms granularity can cause significant p99 increases even when average CPU is well under the limit. Many production teams (including some large Kubernetes users) recommend request-only for latency-sensitive workloads.
For latency-sensitive services, CPU throttling metrics are more important than CPU usage. If nr_throttled is non-zero and p99 is up, the fix is either to raise or remove the CPU limit. CPU usage at 40% is fine; CPU throttled time at any significant fraction is not.
PIDs and Other Limits
# Cap PIDs (prevents fork bombs)
docker run -d --pids-limit=256 myapp
# Cap ulimits
docker run --ulimit nofile=2048:4096 myapp # file descriptors
docker run --ulimit nproc=256 myapp # processes
Most container workloads do not need many PIDs (10-50 is typical). Capping at 512 is almost always safe and prevents fork-bomb-style DoS.
Health Checks: Let the Orchestrator Help
A healthcheck is a command the container runtime runs periodically to determine if the container is working. If it fails enough times, the container is marked unhealthy — and the orchestrator can act (stop sending traffic, restart, reschedule).
Dockerfile-level HEALTHCHECK
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD wget --spider -q http://localhost:8080/health || exit 1
Parameters:
--interval— how often to run (default 30s).--timeout— max time the healthcheck command may take (default 30s).--start-period— grace period after container start; failures here do not count toward retries.--retries— how many consecutive failures mark the container unhealthy (default 3).
Runtime-level (docker run --health-*)
Overrides the image's healthcheck:
docker run -d --name api \
--health-cmd='curl -f http://localhost:8080/health || exit 1' \
--health-interval=30s \
--health-timeout=3s \
--health-retries=3 \
myapp
Compose
services:
api:
image: myapp
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 3s
retries: 3
start_period: 10s
Kubernetes (three kinds of probes)
Kubernetes is more nuanced — it has three probe types, each with a distinct role:
spec:
containers:
- name: app
image: myapp
startupProbe: # runs until it passes once; then disabled
httpGet: { path: /health, port: 8080 }
periodSeconds: 2
failureThreshold: 30 # up to 60s for slow starts
livenessProbe: # if fails, kubelet restarts the container
httpGet: { path: /health, port: 8080 }
periodSeconds: 30
failureThreshold: 3
readinessProbe: # if fails, pod is removed from Service endpoints
httpGet: { path: /ready, port: 8080 }
periodSeconds: 5
failureThreshold: 2
| Probe | What happens on failure | Use for |
|---|---|---|
startupProbe | Nothing while running; gates liveness/readiness from starting | Long-starting apps (Java, big-ML init) |
livenessProbe | kubelet restarts the container | Detecting deadlock / hang |
readinessProbe | pod removed from Service endpoints (no traffic) | Graceful shutdown, temporary unready states |
The Kubernetes three-probe model is the right mental model even for plain Docker. Separate "is the container alive enough to take traffic" (readiness) from "is the container so broken it needs a restart" (liveness). Getting this right reduces unnecessary restarts and stops traffic from hitting broken instances.
Designing a good /health endpoint
Three levels of depth:
- Liveness — "is the process responsive?" Return 200 if the HTTP handler runs. No dependency checks. Non-zero exit = total hang.
- Readiness — "can I serve traffic?" Check database connection, cache connection, critical config. 200 if all deps OK.
- Startup — "have I finished initializing?" Only returns 200 once warmup is done.
Avoid these common mistakes:
- Calling all downstream services from liveness. If the DB is down, every container becomes "unhealthy" and kubelet restarts them all — turning a DB outage into a compound incident.
- Returning 200 with a body like
{"status":"error"}. Containers runtimes check the HTTP status code only. - Slow healthchecks. Target < 100 ms. A healthcheck that takes 3 seconds will sometimes timeout legitimately, causing false-positive failures.
A production HEALTHCHECK pattern
# Install a tiny check binary in the image
# curl and wget work; for distroless images, use a HEALTHCHECK NONE and rely on the orchestrator
FROM node:20-slim
# ... build ...
HEALTHCHECK --interval=15s --timeout=3s --start-period=30s --retries=3 \
CMD node -e "require('http').get('http://localhost:8080/health', r => process.exit(r.statusCode===200?0:1)).on('error', () => process.exit(1))"
Or for Go / distroless:
FROM gcr.io/distroless/static-debian12
COPY --from=build /out/app /app
COPY --from=build /out/healthcheck /healthcheck
HEALTHCHECK --interval=15s --timeout=3s \
CMD ["/healthcheck"]
/healthcheck is a tiny Go binary that opens localhost:8080/health and returns 0 or 1 based on the status code. Avoids bundling curl / wget in distroless.
Watching Limits Work
# Live resource usage per container
docker stats
# CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
# myapp-1 52.3% 412MiB / 512MiB 80% ... ... 9
# One-shot
docker stats --no-stream myapp-1
# Inspect after an incident
docker inspect myapp-1 --format='Exit={{.State.ExitCode}}, OOM={{.State.OOMKilled}}, Err={{.State.Error}}'
# Exit=137, OOM=true, Err=
# Healthcheck history
docker inspect myapp-1 --format='{{json .State.Health}}' | jq
# {
# "Status": "healthy",
# "FailingStreak": 0,
# "Log": [
# {"Start":"2026-04-20T10:00:00Z","End":"...","ExitCode":0,"Output":""},
# ...
# ]
# }
In Kubernetes:
kubectl describe pod mypod
# ...
# State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Last State: Running
# Events:
# ... FailedScheduling: 0/10 nodes are available: 8 Insufficient memory, 2 Insufficient cpu.
# ... Unhealthy: Liveness probe failed: HTTP probe failed with status code: 500
What Not to Do
Setting unrealistic limits
resources:
limits:
memory: "32Ki" # 32 kilobytes, not megabytes. Bug waiting to happen.
Typos on limits are embarrassing because the container starts, gets OOM-killed instantly, restarts, repeats forever. Always test limits with realistic workload.
Using a healthcheck that restarts healthy containers
If your livenessProbe hits /health which calls your database, and your database has a 10-second hiccup, every container restarts. Your fleet thrashes through the outage instead of just weathering it.
Rule: liveness checks should not depend on external services. Readiness checks can. This distinction is the difference between "restart on internal deadlock" (correct) and "restart the whole fleet whenever anything is slow" (disastrous).
Omitting start_period for slow-starting apps
Java services can take 30-60 seconds to warm up. Default healthcheck start_period is 0-30 seconds depending on runtime. If the probe fires before the JVM is ready, the container is marked unhealthy and restarted — and the cycle repeats forever.
# Java Spring Boot app
startupProbe:
httpGet: { path: /actuator/health, port: 8080 }
periodSeconds: 5
failureThreshold: 24 # 2 minutes of grace
livenessProbe:
httpGet: { path: /actuator/health/liveness, port: 8080 }
periodSeconds: 30
readinessProbe:
httpGet: { path: /actuator/health/readiness, port: 8080 }
periodSeconds: 5
The startupProbe gets 2 minutes to pass once; only then do liveness and readiness start probing.
Setting no resource requests on Kubernetes
# No requests. Scheduler assumes zero, crams the pod anywhere.
# Under load, nothing is guaranteed. Eviction is a coin flip.
resources:
limits:
memory: 512Mi
Always set requests. Limits without requests create unpredictable scheduling and QoS behavior.
Key Concepts Summary
- Every production container needs a memory limit. Without one, any container can take down its host.
- Memory limits OOM-kill; CPU limits throttle. Different failure modes; watch both.
- Set container memory limit above the runtime heap limit. JVM
-Xmx3gwith container limit4g, for headroom. - CPU throttling metrics (
nr_throttled) matter more than CPU usage for latency-sensitive services. Many teams run request-only for those. - Three probe types (K8s) or
HEALTHCHECK(Docker). Liveness = "is it alive?"; Readiness = "should I send traffic?"; Startup = "has init finished?" - Liveness probes should not depend on external services. Otherwise a DB hiccup causes fleet-wide restarts.
- Slow-starting apps need
startupProbeor longstart_period. Otherwise infinite restart loops. - Healthcheck endpoints should be sub-100ms. HTTP status code only; no body parsing.
docker stats,docker inspect,kubectl describeare the tools for watching limits and probes work in practice.
Common Mistakes
- No memory limit in production. One bad deploy takes the host down.
- Setting a CPU limit on a latency-sensitive service and puzzled why p99 is bad. Check
cpu.statnr_throttled. - Liveness probe hits DB — DB burp → full fleet restart.
- Forgetting
start_periodon slow-starting apps. Restart loops during deploys. - Setting limits in bytes vs MiB vs MB incorrectly.
512Mis 512 000 000 bytes;512Miis 512 × 2²⁰. Docker and Kubernetes accept both but differ on defaults. - Kubernetes: setting limits but no requests. QoS becomes "BestEffort" or "Burstable" unpredictably.
- Healthcheck that returns 200 with error JSON. Runtimes check status code only.
- Using
HEALTHCHECK NONEand no orchestrator probes. Literally nothing knows whether your container is working. - Sharing resources incorrectly: two containers on one host, each with
--cpus=4but host has 4 cores. The limits overlap; thrashing ensues. - Swap enabled with
--memory-swap=-1. Container can swap infinitely, performance degrades to a crawl instead of dying cleanly.
Your Kubernetes pod has `livenessProbe` that hits `/health`, which internally checks the database connection. The database has a 20-second outage. kubectl describe shows every pod restarted during that window. Traffic dropped much longer than the DB outage itself. What went wrong, and what's the fix?