All posts
Networking

Your Service Worked at 1,000 RPS. At 3,000 It Started Failing With 'Connection Refused'. Welcome to TIME_WAIT and conntrack.

Two unrelated kernel limits bite high-throughput Kubernetes services: ephemeral port exhaustion from TIME_WAIT and conntrack table overflow. Same symptom, different root causes, different fixes.

By Sharon Sahadevan··13 min read

A service is running fine. You scale it from 5 replicas to 30. Throughput jumps from 1,000 RPS to 4,000 RPS. Latency p99 also jumps, but worse: about 5% of requests now fail with connect: connection refused or connect: cannot assign requested address.

You check the application logs. Nothing useful. You check the target service. It is healthy and not even sweating. The failures are happening before the request ever reaches it.

You check the node:

$ ss -s
Total: 32567
TCP:   28442 (estab 1245, closed 26801, orphaned 8, timewait 26012)

26,012 sockets in TIME_WAIT state. Out of 28,000 used. Now check:

$ cat /proc/sys/net/ipv4/ip_local_port_range
32768   60999

The ephemeral port range is 28,231 ports. You have 26,012 of them tied up in TIME_WAIT. You are out of source ports.

Or:

$ cat /proc/sys/net/netfilter/nf_conntrack_count
262143
$ cat /proc/sys/net/netfilter/nf_conntrack_max
262144

Conntrack table is full. New connections are dropped at the kernel before the application sees them.

These are two different problems with the same symptom. Both bite high-throughput services. Both are unsexy Linux defaults that have not been tuned for your workload. This post is the mechanics of each, the diagnostic that tells them apart, and the fixes that actually work.

Problem 1: TIME_WAIT and ephemeral port exhaustion#

Every TCP connection has a four-tuple: (source IP, source port, dest IP, dest port). For an outbound connection from a pod calling another service:

  • Source IP: the pod IP (or the node IP after SNAT, depending on CNI).
  • Source port: a random port from the ephemeral range (default 32768-60999, ~28K ports).
  • Dest IP and dest port: fixed, known.

When the pod calls the same backend repeatedly, source IP, dest IP, and dest port are all constant. Only the source port varies. The number of distinct source ports caps how many simultaneous connections (in any state) you can have to that one backend.

Why TIME_WAIT eats source ports#

When the active closer of a TCP connection sends FIN and waits for the other side's FIN+ACK, then sends a final ACK, the connection enters TIME_WAIT. It stays there for 2 * MSL (Maximum Segment Lifetime). On Linux, that defaults to 60 seconds.

The reason: a delayed packet from the just-closed connection might arrive at the same (source, port, dest, dest_port) tuple. If a new connection used the same tuple immediately, the delayed packet from the old connection could be misinterpreted as part of the new one.

Cost: for 60 seconds after a connection closes, that source port cannot be reused for a new connection to the same destination.

The math: when does this bite?#

A pod making 500 outbound HTTP requests per second to one backend, with each connection short-lived (HTTP/1.1 without keep-alive, or HTTP/2 stream limits hit), generates 500 connections per second that go through TIME_WAIT.

500 RPS * 60 seconds in TIME_WAIT = 30,000 sockets in TIME_WAIT at any given moment.

Ephemeral port range is 28,000 ports. The pod runs out. New connect() calls fail with EADDRNOTAVAIL: cannot assign requested address.

This is exactly the "5% failure rate" symptom in the scenario above. The first wave of requests fills source ports; subsequent ones fail until TIME_WAITs start expiring; system finds steady state where some fraction of connect attempts always fail.

Fix 1: connection pooling#

The right fix is almost always at the application layer. Reusing connections instead of opening one per request eliminates the TIME_WAIT churn.

// Go example
client := &http.Client{
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,        // critical for one-backend-heavy workloads
        IdleConnTimeout:     90 * time.Second,
    },
}
# Python with requests
session = requests.Session()
adapter = HTTPAdapter(pool_connections=10, pool_maxsize=100)
session.mount('http://', adapter)
session.mount('https://', adapter)

For HTTP/1.1, this means keep-alive. For HTTP/2, this means reusing the same connection for multiple streams. For database connections, this means a pool. For gRPC, the channel itself is the pool.

In load tests, run with the actual production HTTP client config. A test with curl -X POST in a loop creates a new connection per request and looks fine because the test harness has plenty of source ports; a real service that does the same hits the wall the moment it scales.

Fix 2: tcp_tw_reuse (safe in modern Linux)#

When connection pooling is not enough or not feasible, net.ipv4.tcp_tw_reuse lets the kernel reuse a port in TIME_WAIT for a new outbound connection, as long as the new connection's timestamps are strictly newer than the old one's. Modern Linux (kernel 4.12+) makes this safe by default mechanism.

# Enable tcp_tw_reuse
sysctl -w net.ipv4.tcp_tw_reuse=1

# Persist
echo "net.ipv4.tcp_tw_reuse = 1" >> /etc/sysctl.d/99-tcp-tuning.conf

This is safe to enable widely. It only affects outbound connections (which are the typical TIME_WAIT consumer in a Kubernetes pod). Inbound TIME_WAIT (server side) is unchanged.

Fix 3: do NOT enable tcp_tw_recycle#

net.ipv4.tcp_tw_recycle is the dangerous cousin. It reuses TIME_WAIT sockets for any connection from the same source IP, regardless of timestamps. In environments with NAT (which Kubernetes is, between nodes), this drops legitimate connections from clients behind NAT because the kernel sees them as "stale" timestamps.

tcp_tw_recycle was removed entirely in Linux kernel 4.12 because it was too dangerous. If you find advice online recommending it, that advice is from before 2017. Ignore it.

Fix 4: widen the ephemeral port range#

The default 32768-60999 is conservative. You can widen it:

sysctl -w net.ipv4.ip_local_port_range="10000 65535"

This gives you ~55K ports instead of 28K. Doubles your headroom.

Trade-off: ports below 32768 might be used by services that bind to specific ports. If your nodes run anything that bound to ports in 10000-32768 range, those binds fail because the kernel might give those ports to ephemeral connections first. Audit before changing.

Fix 5: shrink TIME_WAIT duration#

You can lower tcp_fin_timeout, but this is the duration of FIN_WAIT_2, not TIME_WAIT. TIME_WAIT itself is hardcoded at 60s in mainline Linux and not directly tunable. Various blog posts mention tcp_tw_timeout; that flag is BSD-only, not Linux. Forget about lowering TIME_WAIT itself; use tcp_tw_reuse instead.

Problem 2: conntrack table overflow#

Separate problem, related symptoms.

Linux's netfilter (the framework iptables sits on top of) maintains a connection tracking table: every connection through the host is tracked so the kernel can do stateful firewalling, NAT, and connection-aware routing. Each entry is a few hundred bytes; the table has a maximum size:

sysctl net.netfilter.nf_conntrack_max
# Default: depends on RAM, often 262144

Every TCP and UDP "flow" through the host counts. In Kubernetes, the kube-proxy iptables/IPVS DNAT rules go through conntrack. So does every cross-node pod-to-pod connection. So does every external connection. A busy node with 50K simultaneous connections through it is using 50K conntrack entries.

When the table fills, the kernel logs:

nf_conntrack: table full, dropping packet

In dmesg. New connections to and through the host get silently dropped. The pod sees connection timeout or connection refused depending on stage. The application sees 5% failure rate and has no idea why because nothing in user space tells it about the kernel's conntrack table.

Diagnosing conntrack issues#

# Current count vs max
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Drops due to full table
nstat -n | grep -i conntrack
# Look for: nf_conntrack_drop, nf_conntrack_insert_failed

# In dmesg
dmesg | grep -i conntrack | tail

The kernel counter nf_conntrack_drop increments every time a packet is dropped because the table is full. It is the smoking gun for "I am being conntrack-throttled."

Fix: increase nf_conntrack_max and tune timeouts#

Default is sized for general server use, not for a Kubernetes node serving thousands of concurrent flows.

# Raise to 1M
sysctl -w net.netfilter.nf_conntrack_max=1048576

# Also raise the hash table buckets (1/4 to 1/8 of nf_conntrack_max)
sysctl -w net.netfilter.nf_conntrack_buckets=262144
# (or write to /sys/module/nf_conntrack/parameters/hashsize)

Memory cost: each entry is roughly 300 bytes. 1M entries = ~300MB. On a node with 16GB+ RAM, this is trivial.

Tune timeouts to free entries faster:

# How long a fully-established TCP connection's conntrack entry lives after going idle
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
# Default is 5 days (432000s); 1 day (86400s) is plenty for most clusters

# How long a TIME_WAIT entry lives in conntrack
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
# Default is 120s; 30s is reasonable

# Generic timeout for unknown protocols
sysctl -w net.netfilter.nf_conntrack_generic_timeout=60
# Default is 600s

Persist all of these in /etc/sysctl.d/99-conntrack.conf.

conntrack and Kubernetes (kube-proxy mode interaction)#

In kube-proxy iptables mode, every connection through a Service hits conntrack twice (once for the DNAT to the pod IP, once back). In IPVS mode, conntrack is also used (IPVS uses netfilter for connection state). In both, conntrack is on the critical path.

The eBPF kube-proxy replacement (Cilium) bypasses conntrack for Service connections, processing them in eBPF programs instead. This dramatically reduces conntrack pressure on Service-heavy nodes. Worth knowing if your conntrack tuning is not enough.

Telling the two problems apart#

Symptom: connection failures at high throughput.

Quick discriminators:

# 1. Source port exhaustion (TIME_WAIT problem)
ss -s | grep timewait    # large number?
ss -ant | wc -l          # close to ephemeral range size?
# error message: "cannot assign requested address" / EADDRNOTAVAIL

# 2. conntrack table full
cat /proc/sys/net/netfilter/nf_conntrack_count   # close to nf_conntrack_max?
nstat -n | grep -i conntrack_drop                # nonzero?
dmesg | tail | grep -i conntrack                 # "table full" messages?
# error message: timeout / connection refused (depends on what the node forwards)

If errors come from the calling pod's perspective with EADDRNOTAVAIL, it is source port exhaustion, fix at the calling side.

If errors come from anywhere going through the node with packet drops in dmesg, it is conntrack, fix at the node level.

Both can happen at once on a heavily loaded node.

Setting up monitoring#

These two problems are silent until they bite. Catch them early with metrics:

# 1. Conntrack utilization
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.7

# 2. Conntrack drops (any nonzero is bad)
rate(node_netstat_NfConntrackInsertFailed[5m]) > 0
# Note: needs node_exporter with the `netstat` collector + conntrack module loaded

# 3. TIME_WAIT counts on critical pods
# Requires custom exporter parsing /proc/$PID/net/sockstat or ss output

For per-pod TIME_WAIT visibility, you need a sidecar or a pod-level exporter that scrapes ss output. node_exporter only gives node-level views. The cadvisor-derived container_network_* metrics in Kubernetes give bytes/packets, not socket states.

A simple workaround: alert on application-side EADDRNOTAVAIL errors. They are the canonical signal.

When to suspect this in production#

The pattern is distinctive:

  • Failures are a percentage, not all-or-nothing. (5% rather than 100%.)
  • They appear at a throughput threshold. (Below 2K RPS fine; above 3K RPS broken.)
  • Scaling out helps slightly (more pods = more source ports per backend), but not linearly.
  • Node CPU and memory are normal.
  • Application latency is fine for successful requests; failed requests fail fast (connection-level).

If the symptoms match, run the diagnostics. Both fixes (sysctl tuning + connection pooling) are quick wins.

Common mistakes#

1. tcp_tw_recycle = 1. Old advice from pre-2017 articles. Now removed from kernel; if your old kernel still has it, do not enable it.

2. Tuning conntrack on the wrong host. In Kubernetes, the conntrack table is per-node, not per-pod. The DaemonSet that tunes sysctls must run on every worker node. Setting it once in your dev box does not help.

3. Ignoring SNAT in CNI. Some CNIs SNAT pod traffic to the node IP for outbound. This means all pods on a node share the node's ephemeral port range. A 50-pod node is sharing 28K ports across all pods. Your "per-pod throughput" calculation needs to account for this.

4. Forgetting per-namespace conntrack settings. When using bridge networking with separate network namespaces (the default in containerd), conntrack entries are tracked in the initial netns, not per-pod. Tuning needs to happen on the host, not inside containers.

5. Connection pool too small. "We have a pool" is necessary but not sufficient. If your MaxIdleConnsPerHost is 2 but you make 1,000 RPS to one host, you are still creating connections and burning ports. Size the pool to match throughput.

6. Not testing at production-realistic concurrency. Functional tests at 100 RPS look fine; production at 5,000 RPS hits the wall. Load tests should match production concurrency or you discover this in prod.

Quick reference: the high-throughput TCP checklist#

1. Check the symptom:
   ss -s                   # high timewait?
   sysctl net.netfilter.nf_conntrack_count   # near limit?
   nstat -n | grep ConntrackInsertFailed     # nonzero?

2. Source port exhaustion fix:
   - Application: connection pool with right MaxIdleConnsPerHost
   - Kernel: sysctl -w net.ipv4.tcp_tw_reuse=1
   - Optional: widen ip_local_port_range to 10000-65535

3. conntrack exhaustion fix:
   - Increase nf_conntrack_max (1M is reasonable)
   - Increase nf_conntrack_buckets to ~25% of max
   - Lower nf_conntrack_tcp_timeout_established (86400s)
   - Lower nf_conntrack_tcp_timeout_time_wait (30s)

4. Persist via DaemonSet that writes to /etc/sysctl.d/

5. Monitor:
   - conntrack utilization (>0.7 warn)
   - conntrack drops rate (any nonzero alert)
   - application EADDRNOTAVAIL error rate

6. For scale: consider eBPF kube-proxy replacement (Cilium) which
   bypasses conntrack for Service traffic entirely.

A DaemonSet that applies these settings#

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sysctl-tuner
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: sysctl-tuner
  template:
    metadata:
      labels:
        app: sysctl-tuner
    spec:
      hostPID: true
      hostNetwork: true
      tolerations:
        - operator: Exists
      initContainers:
        - name: sysctl-tuner
          image: busybox:1.36
          securityContext:
            privileged: true
          command:
            - sh
            - -c
            - |
              set -e
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w net.ipv4.ip_local_port_range="10000 65535"
              sysctl -w net.netfilter.nf_conntrack_max=1048576
              sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=86400
              sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
              echo "sysctl tuning applied"
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.10

The init container applies the settings; the pause container keeps the pod alive (so the DaemonSet stays scheduled and reapplies on node reboot when sysctls reset).

For production-grade, manage these via the kubelet config, the node bootstrap script, or a real config-management tool. The DaemonSet is the lowest-friction option for a quick fix.

The mental model#

Two distinct kernel limits, both invisible to the application, both bite at high throughput:

  • TIME_WAIT + ephemeral ports is a per-source-IP, per-destination problem. The fix is at the calling side: connection pooling, tcp_tw_reuse, wider ephemeral range.

  • conntrack is a per-node problem. The fix is at the node level: increase the table size, tune timeouts.

Both have been around since the 1990s but show up in Kubernetes because:

  • High pod density per node multiplies the conntrack pressure.
  • Microservices make many short-lived connections (pre-keep-alive HTTP1.1, gRPC stream limits, etc.).
  • Service mesh sidecars add another conntrack hop per request.

The defaults are not for your workload. Tune them. Or run into them at the worst possible time.


The full TCP/IP stack (handshake, states, conntrack, NAT, congestion control) is the spine of the Networking Fundamentals course. The Linux kernel side (netfilter, sysctls, /proc/net surfaces) lives in the Linux Fundamentals course.

More in Networking