Container Is Slow
A customer complains that the API feels laggy. The dashboards show the containers at 40% CPU, 60% memory — nothing dramatic. No errors in logs.
docker statspaints a calm picture. But p99 latency has quietly doubled over the last week, nobody knows why, and the "fix" the team keeps trying is scaling up. It does not help. Eventually an engineer noticesnr_throttledin the cpu.stat file climbing 10x per minute. The container has a CPU limit of--cpus=1; the workload bursts past 1 CPU for hundreds of milliseconds at a time and gets throttled, adding 50-200 ms of queueing latency per request. The fix: remove the CPU limit, keep the request. Latency returns to baseline immediately."The container is slow" is the most information-starved symptom you can get. The tools to diagnose it are all there —
docker stats, the cgroupcpu.statandmemory.eventsfiles, host-level tools likevmstatandiostat, perf sampling — but knowing which one to pull out first is the skill. This lesson is the slow-container flowchart: what to check in what order, how to tell "the container is constrained" from "the host is overloaded" from "the app is just slow," and how to instrument production to catch each class early.
The Categories of Slow
Before debugging, classify. "Slow" always falls into one of:
- CPU-bound — work is pegged on one or more cores inside the container. May or may not be throttled by a CPU limit.
- Memory-bound — heavy swapping, page cache churn, or GC pressure. Memory limit may be imposing throttling.
- I/O-bound — disk or network calls are slow. Either the storage layer is genuinely slow or the container is I/O-rate-limited.
- Scheduler / throttling — the kernel is pausing the container at the cgroup level. This is the most invisible class and the most common.
- Contention with neighbors — the host is overloaded; your container is fine but cannot get the resources.
- Application logic — slow external API, inefficient query, bad algorithm. The container is not the problem.
The debugging tools are different for each category. The flowchart:
Is the container slow?
1. docker stats (live)
└── CPU %? Memory %? Network I/O? Disk I/O?
2. Check for cgroup throttling
└── cat /sys/fs/cgroup/.../cpu.stat — nr_throttled
└── cat /sys/fs/cgroup/.../memory.events — high + max events
3. Check the host
└── vmstat 1 — is r column (run queue) long?
└── iostat -xz 1 — is await rising?
└── top → iowait, us, sy
4. Check the app
└── Enter the container, profile (perf, py-spy, flame graphs)
5. Check downstream
└── Is the DB slow? External API rate-limited?
Step 1: docker stats — The 15-Second Snapshot
docker stats
# CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
# abc123... myapp 183% 780MiB / 1GiB 76% 1.2GB / 900MB 120MB / 80MB 24
Quick reads:
- CPU % > (number of CPUs × 100%)? No — you cannot exceed your cpuset or shares, but with no limit you can exceed 100%. A container showing 200% means it is using ~2 cores.
- CPU % near the limit? E.g., limit is
--cpus=1and you see 99%. Suspect throttling. - MEM USAGE near LIMIT? Getting close to memory.max means the kernel is reclaiming aggressively; about to OOM.
- Net I/O or Block I/O mismatched against traffic? If traffic is steady and NET I/O spiked, something else is moving data.
docker stats shows totals from the cgroup files. It is a summary; it does not reveal throttling.
Step 2: Check for Throttling (The Invisible Killer)
# Find the container's cgroup
docker inspect myapp --format='{{.Id}}'
# abc123...
CGROUP=/sys/fs/cgroup/system.slice/docker-abc123....scope
# CPU throttle stats (cgroup v2)
cat $CGROUP/cpu.stat
# usage_usec 12345678
# user_usec 10234567
# system_usec 2111111
# nr_periods 1823
# nr_throttled 42 ← cgroup was throttled 42 times
# throttled_usec 1234567 ← total 1.2 seconds paused
# Watch over time
watch -n 1 "cat $CGROUP/cpu.stat"
If nr_throttled is climbing, the container is hitting its CPU limit. This is the classic "container is slow but CPU looks fine" pattern: average CPU use is moderate, but within each 100 ms CFS period, short bursts exceed the quota and get throttled.
Fix options:
- Raise the limit. If the workload legitimately needs more CPU, give it more.
- Remove the CPU limit. For latency-sensitive services, this is often the right call. Keep the CPU request for scheduling and weight purposes; drop the hard limit.
- Tune the CFS period. Longer periods smooth out bursts but delay throttling detection. Rarely worth it.
# Memory throttling (cgroup v2)
cat $CGROUP/memory.events
# low 0
# high 823 ← crossed the soft limit 823 times (active reclaim)
# max 3 ← crossed the hard limit 3 times (OOM'd processes)
# oom 3
# oom_kill 5 ← kernel killed 5 processes
# Memory pressure stall info (PSI)
cat /proc/pressure/memory
# some avg10=12.34 avg60=8.90 avg300=6.20 total=1234567
PSI (Pressure Stall Information) is the best "memory or CPU is really tight" signal on modern Linux. Non-zero full avg10 means every process in the system was stuck waiting for memory reclaim for some fraction of the last 10 seconds — this is bad.
The #1 invisible performance issue in containers is cgroup throttling — especially CPU throttling. docker stats does not show it. Only cpu.stat / nr_throttled does. Every production team should alert on this metric. A container with nr_throttled > 0 is experiencing latency spikes the application layer cannot even see, and no amount of horizontal scaling helps.
Step 3: Host-Level Diagnostics
If the container's cgroup is not throttling but the app is still slow, the host may be overloaded.
# CPU / run queue / swapping (1-second samples)
vmstat 1 10
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa st
# 14 2 0 1024 5000 20000 0 0 5 12 1500 3200 30 10 50 10 0
# ↑ 14 runnable processes, 2 blocked; r consistently > CPU count = CPU contention
# ↑ si/so > 0 = swapping (memory trouble)
# ↑ wa (iowait) = I/O trouble
# Per-CPU CPU usage
mpstat -P ALL 1 5
# Disk I/O per device
iostat -xz 1 5
# Device r/s w/s rkB/s wkB/s await aqu-sz %util
# nvme0n1 120 40 4096 1024 5.2 2.1 48.3
# ↑ await > 5ms on NVMe = congestion; %util near 100 = saturated
# Network
sar -n DEV 1 5
# Average: IFACE rxpck/s txpck/s rxkB/s txkB/s ...
The combination of these tells you which layer is the bottleneck:
vmstat r >> CPU countwith low%wa— CPU oversubscribed.vmstat si/so > 0— memory tight, swapping.%wahigh +iostat awaithigh — disk bottleneck.%idlehigh and nothing else explains it — lock contention inside the app.
If you read the Linux Fundamentals course's Module 6 (Production Debugging), you have already seen these tools. The container world adds one layer: separate "is the container constrained by its cgroup" from "is the host constrained in aggregate." Check the cgroup first; if it is not the limit, host-level metrics tell you the rest of the story.
Step 4: Inside the Container — Profile the App
Once you have eliminated cgroup throttling and host-level overload, the application itself is the suspect. Tools for profiling inside a container:
# Get a shell
docker exec -it myapp sh
# CPU profiling for the main process inside
# If the image has perf and the host allows it:
perf top -p 1
# For Python apps, py-spy (does not need root if shared namespace)
pip install py-spy
py-spy top --pid 1
# For Node.js apps, --prof or clinic.js
node --prof server.js # (at build time)
# For JVM
jstack <pid> # thread dump
jfr start --name=profiling <pid>
If the image is minimal (distroless, scratch), you cannot install tools inside. Options:
- Run perf from the host against the container's PID. The container's main process has a real host PID;
perf record -p <host-pid> -gcaptures everything. kubectl debug(K8s) — inject a sidecar with tools.- Ephemeral container that shares PID namespace:
docker run --rm -it --pid=container:myapp nicolaka/netshoot.
top -H and htop
The simplest: look at threads (if you have -H support). One thread pegged while others idle = single-threaded bottleneck (common in Node, or in Python's GIL-holding code, or one hot loop in any language).
strace for syscall patterns
# Attach to the app's main process inside the container
# (needs CAP_SYS_PTRACE; may require --cap-add=SYS_PTRACE)
docker exec myapp strace -c -p 1 -f
# Runs for a few seconds, then prints a summary
# % time seconds usecs/call calls errors syscall
# 35.40 0.234567 1500 156 0 recvfrom
# 28.90 0.189012 1322 143 143 connect ← every connect is failing!
# 12.30 0.080501 89 904 0 epoll_wait
# ...
A skew in the syscall distribution — lots of time in connect, read, epoll_wait, futex — tells you what category of work the app is doing. futex dominant = lock contention. epoll_wait dominant = idle or waiting on I/O. connect with errors = DNS / upstream is slow.
Step 5: Check Downstream
The app may be slow because it is waiting on something else:
# From inside the container, time a DB query
docker exec myapp sh -c 'time psql -h db -U app -c "SELECT 1"'
# real 0m3.501s ← 3.5 seconds for a trivial query
# → the DB is slow / network to DB is slow / connection pool is exhausted
# Check network latency to neighbors (if nsenter is available on the host)
PID=$(docker inspect myapp --format='{{.State.Pid}}')
sudo nsenter -t $PID -n ping -c 3 db.internal
# DNS resolution timing
docker exec myapp sh -c 'time getent hosts db.internal'
# real 0m5.000s ← DNS is slow (5 seconds to resolve)
Slow downstream dependencies are invisible from the container's own metrics. Distributed tracing (OpenTelemetry, Jaeger, Tempo) is the right way to catch this at scale, but in a pinch, timed curls / DB pings reveal the bottleneck.
The "Noisy Neighbor" Case
Multiple containers on one host can compete for resources that cgroup limits do not partition cleanly:
- Page cache — two containers both need hot files; they evict each other.
- Memory bandwidth — unlike CPU time, memory bandwidth is not partitioned by cgroups.
- NIC and disk controller queues — limits are per-container, but the physical queues are shared.
- CPU cache lines — cache-unfriendly neighbor evicts your hot data.
Signs of a noisy neighbor
- Your container's metrics look fine, but response time oscillates.
- Other containers on the host have high CPU or I/O.
- Moving the workload to a less busy host improves things immediately.
Diagnosis
# All running containers and their load
docker stats --no-stream | sort -k3 -r # by CPU%
# PSI — is the HOST under pressure?
cat /proc/pressure/cpu
cat /proc/pressure/io
# Who is hogging what?
top -o %CPU # sort by CPU
pidstat -d 1 # per-process I/O
Mitigations:
- Spread containers: different hosts, or different NUMA nodes on the same host.
- Apply I/O limits (
--device-read-bps/--device-write-bps) to prevent one container from saturating a shared device. - Use
--cpuset-cpusto pin latency-sensitive workloads to their own cores.
Concrete Patterns
CPU throttling on Kubernetes pod
# BAD for a latency-sensitive API
resources:
requests:
cpu: "500m"
limits:
cpu: "1" # bursts past 1 CPU get throttled; p99 suffers
# Better — request without hard limit
resources:
requests:
cpu: "500m" # scheduling + cpu.weight under contention
# no limit — bursts are free when the node has capacity
Known tradeoff: a runaway loop can now use more than its "request" under contention-free conditions. For most real workloads (web services), this is a feature, not a bug — you want the API to use spare capacity during spikes.
JVM heap + container limit
Container limit: 1 GiB
JVM heap (-Xmx): 1 GiB ← WRONG
The JVM reserves memory beyond -Xmx (direct buffers, metaspace, thread stacks, code cache, GC overhead). Setting -Xmx equal to the container limit leads to OOM-kills as the JVM's total memory exceeds the cgroup limit.
Container limit: 1 GiB
JVM heap (-Xmx): 768 MiB ← better, leaves 256 MiB for JVM overhead
Or use JVM container-aware flags (JDK 11+): -XX:MaxRAMPercentage=70.0 sets the heap to 70% of the container's memory limit automatically.
Postgres connection pool exhaustion
Symptom: p99 latency spikes, app logs show "connection timeout" or long wait times. The container's CPU/memory is fine.
Root cause: the DB connection pool is exhausted; new requests queue.
Fix: scale up the pool, or scale up the DB itself (max_connections is finite). Often solved by connection pooling at a proxy layer (pgbouncer, RDS Proxy).
A service's p99 latency tripled after a deploy. No container metrics were off. No container was throttled. The team scaled up horizontally — no effect. Eventually someone looked at strace -c on the running container and saw that 70% of syscall time was in connect, all hitting the same internal API. That API was rate-limited at 100 req/s; the client was hitting 300. The client was queued behind its own 429 responses. Fix was at the client layer (retry with backoff + caching), not the container layer. Lesson: "container is slow" might mean nothing about the container; always check what it is waiting on.
Running Benchmarks Against Your Container
To separate "fast on my laptop" from "slow under production-like load":
# HTTP load tests
docker run --rm -i --network host grafana/k6 run - <<'EOF'
import http from 'k6/http';
export const options = { vus: 50, duration: '30s' };
export default function () { http.get('http://localhost:8080/health'); }
EOF
# Or hey, wrk, ab, vegeta — any HTTP benchmark
hey -z 30s -c 50 http://localhost:8080/health
Combine with docker stats and cpu.stat on the target container to see exactly what happens under load. This catches cgroup throttling issues before production does.
Key Concepts Summary
- Classify first. CPU-bound / memory-bound / I/O-bound / throttled / noisy neighbor / downstream — six categories, different tools.
- Check cgroup throttling first.
cpu.statnr_throttledandmemory.eventsare invisible todocker stats. - For latency-sensitive services, consider skipping CPU limits. Request without limit under Kubernetes gives scheduling + weight without CFS-quota throttling.
vmstat,iostat,mpstatreveal host-level bottlenecks (from the Linux course's debugging module).- Profile inside the container with language-specific tools (py-spy, jstack, perf) once you've ruled out cgroup and host issues.
- PSI (
/proc/pressure/*) is the modern "is anything waiting on CPU/mem/IO right now" signal. - Set JVM / runtime heap limits below the container's memory limit to leave headroom.
- Distributed tracing catches downstream-wait issues the container itself cannot see.
- Test under load before production. Benchmarks reveal throttling that idle runs miss.
Common Mistakes
- Looking at
docker statsCPU % and concluding "CPU is fine" without checkingnr_throttled. - Setting aggressive CPU limits on latency-sensitive services and wondering why p99 is bad.
- Conflating host overload with container constraint. Check both independently.
- Assuming a container at 70% memory is "healthy" without checking if it is actively in reclaim (
memory.eventshighcounter). - Forgetting that bind-mounted filesystems on Docker Desktop are slow. Host I/O benchmarks look fine; container I/O is dragging.
- Running benchmarks at too-low concurrency and missing throttling that only shows up under load.
- Not instrumenting the app. Even a basic request counter + latency histogram in Prometheus beats any "container is slow" debugging after the fact.
- Blaming the container when the downstream is slow. Distributed tracing reveals this; guesses do not.
- Using
topinside a minimal container that does not have top. Install it, or use host-sidetop. - Confusing
cpu.shares/cpu.weight(relative priority under contention) withcpu.max(hard cap). They are different knobs for different problems.
Your Kubernetes pod has `resources.limits.cpu: 2`. Dashboards show it's averaging 1.3 CPUs, never exceeding 2. But p99 latency is 3x higher than p50. `cpu.stat` shows `nr_throttled` increasing by several hundred per minute. What is happening and what is the right fix?