Your Cluster Has 5,000 Services and kube-proxy Is the Bottleneck. Welcome to the iptables Cliff.
Every Service create rewrites your entire iptables chain. At small scale you never notice. At 5,000 Services kube-proxy is at 100% CPU, Service updates take 30 seconds, and your latency p99 is in the seconds. Here is the cliff and how to fall off it.
Your platform team gets a ticket: a deploy that used to take 30 seconds now takes 4 minutes. The bottleneck is not the build, the image pull, the schedule, or the readiness probe. It is the time between when the new Service object is created and when it actually starts routing traffic.
You poke around. kube-proxy on every node is at 80-100% CPU. The CPU spike is bursty: a flatline, then a 30-second peak, then back to baseline. The peaks line up with Service create/update events. Your iptables -L | wc -l returns 250,000 lines. Each Service add or remove rewrites the entire chain.
Welcome to the iptables performance cliff. Your cluster crossed it somewhere between 3,000 and 8,000 Services and you didn't know there was a cliff to fall off. This post is what kube-proxy actually does, why iptables mode falls over at scale, what IPVS gives you instead, the eBPF alternative that is replacing both, and the migration plan that doesn't break production.
What kube-proxy actually does#
Kubernetes Services have a stable cluster IP that load-balances to the pods backing the Service. The load balancer is not a separate component. It is implemented by kube-proxy, a DaemonSet on every node, that programs the node's network stack to do the rewriting.
When traffic in a pod connects to payments.svc.cluster.local:8080:
- CoreDNS resolves the name to the Service's ClusterIP (e.g.,
10.96.45.123). - The pod sends a TCP SYN to
10.96.45.123:8080. - The kernel routing logic on the node intercepts the packet (the ClusterIP is not a real IP that exists anywhere, just a routing target).
- kube-proxy's rules pick a backend pod IP and rewrite the destination to that pod IP and port.
- The packet exits the node toward the chosen pod (possibly on the same node, possibly on another).
The "kube-proxy's rules pick a backend pod IP" step is where the modes differ. iptables, IPVS, and eBPF do this different ways, with very different scaling characteristics.
iptables mode: how it works#
The default mode (and historical default for years). kube-proxy reads the API server, learns about every Service and Endpoint, and writes a giant chain of iptables rules.
For each Service, you get something like this in the nat table:
Chain KUBE-SERVICES (1 references)
target prot opt source destination
KUBE-SVC-MYSVC-A1B2C3 tcp -- any 10.96.45.123 tcp dpt:8080
KUBE-SVC-MYSVC-D4E5F6 tcp -- any 10.96.45.124 tcp dpt:8080
... thousands more ...
Chain KUBE-SVC-MYSVC-A1B2C3 (1 references)
target prot opt source destination
KUBE-SEP-POD-1 all -- any any /* 0.333 probability */
KUBE-SEP-POD-2 all -- any any /* 0.5 probability */
KUBE-SEP-POD-3 all -- any any /* default */
Chain KUBE-SEP-POD-1 (1 references)
target prot opt source destination
DNAT tcp -- any any to:10.244.1.5:8080
For each Service, kube-proxy emits:
- One rule in
KUBE-SERVICESto match the ClusterIP and dispatch to a per-Service chain. - One per-Service chain (
KUBE-SVC-MYSVC-...) that randomly picks among endpoints using thestatisticmodule. - One per-endpoint chain (
KUBE-SEP-POD-...) that DNATs to the actual pod IP.
For 5,000 Services with 5 endpoints each, you have:
- 5,000 entries in
KUBE-SERVICES - 5,000 per-Service chains
- 25,000 per-endpoint chains
- All of this duplicated for cluster-internal traffic, NodePort, LoadBalancer, etc.
Total rules in the table: hundreds of thousands.
The cliff: why it falls over#
The iptables design was made for firewalls in the 1990s where you had dozens of rules. Two properties from that era still apply:
1. Updates rewrite everything. When kube-proxy gets a Service-changed event, it does not patch one rule. It writes the entire ruleset again with iptables-restore. With 250K rules, this single restore call takes 5-15 seconds of CPU time per node. While it runs, kube-proxy is at 100% CPU.
2. Lookups are linear. When a packet hits KUBE-SERVICES, the kernel walks the chain top to bottom comparing destination IP and port. With 5,000 rules in the chain, every packet does up to 5,000 comparisons. At 100,000 packets per second (a moderately busy node), that is 500 million comparisons per second per node. CPU is burned on routing instead of doing work.
Result: kube-proxy CPU is high constantly (because of updates), packet latency is high constantly (because of lookups), and Service-creation propagation time grows linearly with cluster size.
The cliff is gradual, not sudden. At 500 Services nobody notices. At 2,000 you might see kube-proxy CPU bumping up. At 5,000 deploys feel sluggish. At 10,000 the cluster is meaningfully slow on every operation.
IPVS mode: how it actually scales#
IPVS (IP Virtual Server) is a Linux kernel feature designed for load balancers, not firewalls. It stores Services in a hash table inside the kernel and updates incrementally.
When kube-proxy uses IPVS:
# Each Service is an IPVS virtual service
ipvsadm -L -n
TCP 10.96.45.123:8080 rr
-> 10.244.1.5:8080 Masq 1 0 0
-> 10.244.2.7:8080 Masq 1 0 0
-> 10.244.3.9:8080 Masq 1 0 0
Two key differences from iptables:
1. Lookup is O(1). IPVS stores Services in a hash table indexed by (destination IP, destination port, protocol). A packet arrives, the kernel hashes the tuple, finds the matching virtual service, picks an endpoint per the configured scheduler. No chain walking.
2. Updates are incremental. Adding a Service is an ipvsadm -A syscall that adds one entry. Removing is ipvsadm -D. Endpoints can be added or removed individually. No global rewrite, no 100% CPU spikes.
The result: kube-proxy CPU is flat regardless of Service count. Packet latency is constant. Service creation propagation is sub-second even at 10,000+ Services.
Trade-offs to know:
- IPVS still uses iptables for some auxiliary rules (NodePort, source NAT, etc.). The main ClusterIP path is hash-table; the rest is small-rule iptables. Net win.
- IPVS supports more sophisticated load balancing algorithms (round robin, least connections, source hashing, etc.). kube-proxy IPVS mode defaults to round robin (
rr). - IPVS uses a different kernel module (
ip_vs). It must be loaded on every node before kube-proxy can use it.
How to switch from iptables to IPVS#
The right way is per node group, with safety checks. The wrong way is editing the kube-proxy ConfigMap and restarting everything. Here is the right way.
Step 1: load the kernel modules on every node.
IPVS needs these modules: ip_vs, ip_vs_rr, ip_vs_wrr, ip_vs_sh, nf_conntrack. Check:
# On a node
lsmod | grep -E "^ip_vs|^nf_conntrack"
If missing, load and persist:
modprobe ip_vs
modprobe ip_vs_rr
modprobe ip_vs_wrr
modprobe ip_vs_sh
modprobe nf_conntrack
cat <<EOF | sudo tee /etc/modules-load.d/ipvs.conf
ip_vs
ip_vs_rr
ip_vs_wrr
ip_vs_sh
nf_conntrack
EOF
For managed node groups, this typically goes into a node bootstrap script or a custom AMI.
Step 2: install ipvsadm for diagnostics.
# On Ubuntu/Debian
apt-get install -y ipvsadm
# On RHEL/CentOS
yum install -y ipvsadm
You will use this to verify IPVS is actually programming what you expect.
Step 3: change kube-proxy mode.
# kube-proxy ConfigMap (kubectl edit cm kube-proxy -n kube-system)
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
scheduler: rr # round robin (default), rr/wrr/lc/wlc/lblc/lblcr/sh/dh/sed/nq
syncPeriod: 30s
minSyncPeriod: 5s
Step 4: restart kube-proxy.
kubectl rollout restart daemonset/kube-proxy -n kube-system
kubectl rollout status daemonset/kube-proxy -n kube-system
The DaemonSet rolls one pod at a time. Each node briefly loses kube-proxy (existing connections continue, new connections may fail for a few seconds) before the new pod takes over in IPVS mode.
Step 5: verify.
# On a node, see IPVS rules
sudo ipvsadm -L -n | head -20
# Check the count
sudo ipvsadm -L -n | grep -c "^TCP\|^UDP"
# Confirm iptables KUBE-SVC chains are gone
sudo iptables -t nat -L KUBE-SERVICES | wc -l
# Should be much smaller than before
Routing latency should drop noticeably. kube-proxy CPU should be near-zero except during sync intervals.
A subtle gotcha: the dummy interface#
In IPVS mode, kube-proxy creates a dummy network interface (kube-ipvs0) and adds every Service ClusterIP as an alias on it. This is so the kernel's routing logic recognizes the ClusterIP as locally-handled (the prerequisite for IPVS to even see the packet).
ip addr show dev kube-ipvs0
# A long list of ClusterIPs
The downside: at 10,000 Services, that is 10,000 IP aliases. ip addr show on a node becomes slow. Some monitoring tools that read /proc/net/dev or query interfaces individually slow down. This is rarely a real problem, but a thing to know.
The eBPF alternative: kube-proxy replacement#
Cilium and a few other CNIs offer a "kube-proxy replacement" that does the same job using eBPF programs attached to socket layer or TC layer. No iptables, no IPVS, no kube-proxy DaemonSet needed.
# Cilium Helm values to enable it
kubeProxyReplacement: strict
k8sServiceHost: <api-server-host>
k8sServicePort: 6443
When enabled, you uninstall kube-proxy entirely. Cilium's eBPF programs intercept connections in the kernel directly and route to backends.
Advantages over IPVS:
- Even faster lookup (eBPF maps are highly optimized).
- Per-pod policy enforcement at the same layer (NetworkPolicy + LB in one).
- Better observability via Hubble.
Disadvantages:
- Cilium-specific. Locks you to a CNI.
- More complex to debug when things go wrong (eBPF is harder to inspect than iptables or ipvsadm).
For greenfield clusters or clusters already running Cilium, kube-proxy replacement is increasingly the default recommendation. For existing clusters on a different CNI, switching to IPVS first is a smaller jump.
When iptables is fine#
Not every cluster needs to migrate. iptables mode is fine when:
- Service count is under 1,000 and not growing fast. Most enterprise clusters live here forever.
- Your cluster is single-tenant with stable workloads. Service churn is low; the iptables rewrite cost rarely matters.
- You are running an older Kubernetes version that has known issues with IPVS (1.16-1.18 had several). Modern versions (1.25+) are stable.
If you are happy and not seeing kube-proxy CPU or Service-update latency issues, do not migrate just because. iptables mode is well-understood, debuggable with standard tools, and has more battle-test in production than the alternatives.
How to know if you are on the cliff#
Three measurements answer "should I migrate":
1. kube-proxy CPU usage.
# Per-node, kube-proxy CPU as fraction of one core
rate(container_cpu_usage_seconds_total{
namespace="kube-system", pod=~"kube-proxy-.*"
}[5m])
If sustained above 50% on any node, you have an iptables problem. If bursty to 100% on every Service change, you have an iptables problem.
2. Service propagation latency.
# Create a Service, time how long until it routes
time kubectl run test-pod --image=busybox --rm -it --restart=Never -- \
sh -c 'while ! wget -q -O- http://my-service:8080 > /dev/null 2>&1; do sleep 0.5; done'
If propagation is more than a few seconds, you are on the cliff.
3. Service count.
kubectl get svc --all-namespaces | wc -l
Above 2,000 and growing: plan the migration. Above 5,000: stop reading this and go migrate.
Common mistakes during migration#
1. Forgot to load kernel modules on some nodes. kube-proxy starts in IPVS mode but cannot configure anything. Pods on that node lose Service routing. Verify with lsmod | grep ip_vs on every node before flipping.
2. Did not test on staging first. IPVS mode has a few behavioral differences (source IP visibility, conntrack handling, timeouts). Test your specific workloads on a staging cluster before flipping production.
3. Forgot to update monitoring queries. Dashboards looking for iptables_* metrics or KUBE-SVC chain counts go blank after migration. Update queries to use ipvsadm-derived metrics or netfilter_conntrack counts.
4. NodePort source IP changes. In IPVS mode, NodePort traffic gets SNAT-ed differently than in iptables mode. Some applications (rate limiters, audit loggers, geo-IP) that read the source IP from incoming connections see the node IP instead of the original client IP. Set externalTrafficPolicy: Local on the Service to preserve client IP, with the trade-off that traffic only routes to pods on the same node.
5. Did not set --cluster-cidr correctly. In IPVS mode, kube-proxy needs --cluster-cidr (or the equivalent ConfigMap field) to know which addresses are pod IPs vs external. Wrong value: SNAT happens for traffic that should not be SNAT-ed, or vice versa.
6. Performance "did not improve". If your bottleneck was not kube-proxy in the first place, IPVS does not help. CPU on the application pods, network latency between zones, DNS resolution time, etc., are unaffected by kube-proxy mode.
Quick reference: the migration checklist#
1. Measure your starting point:
- kube-proxy CPU (Prometheus)
- Service propagation latency (manual test)
- Service count
2. Validate readiness:
- Kernel modules on every node (ip_vs*, nf_conntrack)
- ipvsadm tool installed for diagnostics
- K8s version 1.25+ for stable IPVS
3. Test on staging:
- Migrate kube-proxy mode to ipvs
- Run full integration test suite
- Verify NodePort source IP behavior
- Compare baseline performance
4. Roll out to production:
- Edit kube-proxy ConfigMap (mode: ipvs)
- kubectl rollout restart daemonset/kube-proxy -n kube-system
- Watch kube-proxy CPU drop
- Re-measure Service propagation latency
5. Verify:
- ipvsadm -L -n shows your services
- iptables KUBE-SVC chains are mostly gone
- Application connectivity unchanged
- Monitoring dashboards updated for new mode
6. Plan ahead:
- eBPF kube-proxy replacement (Cilium) for the next jump
- Or stay on IPVS forever if you don't need eBPF features
The mental model#
kube-proxy is the load balancer for Kubernetes Services. It has three implementations:
- iptables: linear chain, global rewrites. Simple. Falls over at thousands of Services.
- IPVS: hash table, incremental updates. Same model as a real load balancer. Scales to 50K+ Services.
- eBPF (kube-proxy replacement): in-kernel BPF programs. Even faster, but CNI-specific.
The iptables cliff is famous because it is invisible until it is not. Service count is the canary. Above 1,000 you should be aware. Above 5,000 you have probably already noticed and just not connected the dots.
The migration to IPVS is a few hours of work plus a few weeks of testing. The migration to eBPF is a CNI swap, which is bigger. Both are well-trodden paths. The cluster you have today probably wants IPVS; the cluster you build next year probably wants eBPF.
Either way: stop accepting that "Service propagation is slow" is just how Kubernetes works. It is not. It is how iptables-mode kube-proxy at scale works.
The full networking stack for Kubernetes (CNI, kube-proxy modes, conntrack, NetworkPolicy, service mesh) is the spine of the Networking Fundamentals course. The production-scale operational patterns (when to migrate, how to roll out, how to monitor) are part of the Production Kubernetes Operations course.
More in Kubernetes Networking
Why Every Kubernetes Cluster Makes 5 DNS Queries For One Lookup
ndots:5 is the silent latency killer in Kubernetes. Every external hostname resolution generates four wasted queries before the right one. Here is why, and how to fix it.
Read post