L4 vs L7 Load Balancing
You have an AWS ALB routing traffic to 10 pods behind a Kubernetes Service. Monitoring shows one pod is healthy — it passes the TCP health check every 10 seconds. But your users are reporting intermittent 500 errors.
You check the pod logs. Sure enough, that pod is returning HTTP 500 for every single request. The application crashed internally but the process is still running and accepting TCP connections.
The load balancer keeps sending traffic to it. Why? Because your health check operates at Layer 4. It can establish a TCP connection. It has no idea what HTTP status code the application returns. This is the fundamental difference between L4 and L7 load balancing — and understanding it will save you hours of debugging.
Part 1: Layer 4 Load Balancing — Fast, Simple, Blind
Layer 4 load balancing operates at the transport layer of the OSI model. It sees exactly four things about every connection:
- Source IP address
- Destination IP address
- Source port
- Destination port
That is it. An L4 load balancer never opens the TCP payload. It never reads HTTP headers, URL paths, cookies, or request bodies. It makes routing decisions based purely on IP addresses and port numbers, then forwards raw TCP (or UDP) packets to a backend.
# What an L4 load balancer "sees" for each connection:
# Source: 192.168.1.50:49832
# Destination: 10.0.0.100:443
# Protocol: TCP
#
# That is ALL the information it has to make a routing decision.
# It cannot see: GET /api/users HTTP/1.1
# It cannot see: Host: api.example.com
# It cannot see: Cookie: session=abc123
An L4 load balancer is a packet forwarder. It receives a TCP SYN, picks a backend using its algorithm, rewrites the destination IP/port, and forwards the packet. It never terminates the TCP connection — the client talks directly to the backend through the load balancer. This is why L4 is fast: there is no protocol parsing overhead.
L4 Load Balancing Algorithms
Since L4 load balancers cannot see application-level data, their routing algorithms are simple:
Round Robin — send each new connection to the next backend in sequence. Backend 1, then 2, then 3, then back to 1. Simple and fair when all backends are equal.
Least Connections — send each new connection to the backend with the fewest active connections. Better than round robin when request durations vary (one slow request does not cause pile-up on a single backend).
Source IP Hash — hash the client IP and always send that client to the same backend. Provides primitive session affinity without cookies. Breaks when clients share an IP (corporate NAT, mobile carriers).
Weighted Round Robin — same as round robin but some backends get more connections. Useful when backends have different capacity (e.g., 8-core and 16-core nodes).
# Example: iptables-based L4 load balancing (what kube-proxy does)
# Three pods behind a ClusterIP Service:
# Pod A: 10.244.1.5:8080
# Pod B: 10.244.2.8:8080
# Pod C: 10.244.3.2:8080
# kube-proxy creates these iptables rules (simplified):
iptables -t nat -A KUBE-SERVICES \
-d 10.96.0.100/32 -p tcp --dport 80 \
-j KUBE-SVC-XXXX
# Each pod gets a probability-based chain:
iptables -t nat -A KUBE-SVC-XXXX \
-m statistic --mode random --probability 0.333 \
-j KUBE-SEP-POD-A # DNAT to 10.244.1.5:8080
iptables -t nat -A KUBE-SVC-XXXX \
-m statistic --mode random --probability 0.500 \
-j KUBE-SEP-POD-B # DNAT to 10.244.2.8:8080
iptables -t nat -A KUBE-SVC-XXXX \
-j KUBE-SEP-POD-C # DNAT to 10.244.3.2:8080
The iptables probability math looks wrong but it is correct. The first rule fires 33.3% of the time. Of the remaining 66.7%, the second rule fires 50% of that (which is 33.3% of total). The third rule catches everything else (33.3%). This gives equal distribution across three pods. Add a fourth pod and the probabilities become 0.25, 0.333, 0.5, and fallthrough.
Real-World L4 Load Balancers
| Load Balancer | Where You See It | Key Characteristics |
|---|---|---|
| AWS NLB | Cloud infrastructure | Millions of requests/sec, static IP, TLS passthrough |
| kube-proxy (iptables) | Every K8s cluster | Default Service implementation, random selection |
| kube-proxy (IPVS) | Large K8s clusters | Real algorithms (rr, lc, dh, sh), better at scale |
| MetalLB | Bare-metal K8s | Announces Service IPs via ARP/BGP |
| HAProxy (TCP mode) | Traditional infrastructure | Feature-rich, health checks, connection draining |
Part 2: Layer 7 Load Balancing — Slower, Smarter, Content-Aware
Layer 7 load balancing operates at the application layer. It terminates the client TCP connection, parses the HTTP request (or gRPC, WebSocket, etc.), and makes routing decisions based on the full request content.
An L7 load balancer sees everything:
# What an L7 load balancer "sees":
# Everything L4 sees, PLUS:
#
# HTTP Method: GET
# URL Path: /api/v2/users/12345
# Host Header: api.example.com
# Headers: Authorization: Bearer eyJhbG...
# Content-Type: application/json
# X-Request-ID: abc-123
# Cookies: session=xyz789; region=us-east
# Query Params: ?include=profile&limit=50
# Request Body: {"name": "updated-name"}
This visibility enables powerful routing strategies that L4 simply cannot do.
L4 vs L7 Load Balancing
Layer 4 (Transport)
Fast, simple, protocol-agnostic
Layer 7 (Application)
Smart, content-aware, protocol-specific
L7 Routing Strategies
Path-based routing — route by URL path. /api/* goes to the API service, /static/* goes to the CDN origin, /admin/* goes to the admin service.
# Kubernetes Ingress with path-based routing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
spec:
rules:
- host: app.example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
- path: /
pathType: Prefix
backend:
service:
name: frontend-service
port:
number: 80
Header-based routing — route by HTTP header. Useful for canary deployments: if X-Canary: true, route to the canary backend.
Weighted routing — split traffic by percentage. Send 95% to v1 and 5% to v2. Gradually shift weight during rollouts.
Cookie-based sticky sessions — always route a user to the same backend based on a session cookie. Necessary for stateful applications that store session data in memory (though you should fix the app instead).
Sticky sessions are a code smell. They mean your application stores state in local memory instead of an external store (Redis, database). When the sticky backend crashes, the user loses their session. Design stateless services and store session data externally. Use sticky sessions only as a temporary workaround while you fix the architecture.
Real-World L7 Load Balancers
| Load Balancer | Where You See It | Key Characteristics |
|---|---|---|
| AWS ALB | Cloud infrastructure | Path/header routing, WebSocket support, WAF integration |
| NGINX Ingress | K8s clusters | Most popular Ingress controller, annotation-driven config |
| Envoy | Service meshes (Istio) | xDS dynamic config, gRPC-native, advanced observability |
| Traefik | K8s clusters | Auto-discovery, built-in Let's Encrypt, simpler config |
| HAProxy (HTTP mode) | Traditional infrastructure | Very fast, rich ACL system, battle-tested |
Part 3: Health Checks — Where L4 and L7 Diverge Most
This is where the lesson from the opening scenario becomes concrete. Health checks determine whether a backend receives traffic. The layer at which you check determines what failures you can detect.
L4 Health Checks (TCP)
An L4 health check sends a TCP SYN packet to the backend port. If it gets a SYN-ACK back (meaning the port is open and accepting connections), the backend is "healthy."
# What an L4 health check does (equivalent):
nc -zv 10.244.1.5 8080
# Connection to 10.244.1.5 8080 port [tcp/*] succeeded!
# Result: HEALTHY
# But the app behind that port might be:
# - Returning HTTP 500 for every request
# - Deadlocked and not processing any requests
# - Connected to a dead database and timing out
# - Serving stale data from a broken cache
#
# The L4 health check cannot detect ANY of these conditions.
We had a Java service that hit a deadlock in its connection pool. The JVM was still running, the port was still open, TCP health checks passed perfectly. But every HTTP request hung for 30 seconds and then timed out. The L4 load balancer kept sending traffic to it for 45 minutes before someone noticed. Switching to an HTTP health check on /health that verified database connectivity would have caught it in seconds.
L7 Health Checks (HTTP)
An L7 health check sends a real HTTP request (usually GET /health or GET /healthz) and checks the response:
# What an L7 health check does (equivalent):
curl -s -o /dev/null -w "%{http_code}" http://10.244.1.5:8080/health
# 200
# Result: HEALTHY
curl -s -o /dev/null -w "%{http_code}" http://10.244.1.5:8080/health
# 500
# Result: UNHEALTHY — remove from rotation
# A well-designed /health endpoint checks:
# - Can the app connect to its database?
# - Can it reach required downstream services?
# - Is it ready to serve traffic (not still warming up)?
Always use L7 (HTTP) health checks for HTTP services. L4 (TCP) health checks only verify that the process is running and the port is open. L7 health checks verify that the application is actually functional. The small overhead of an HTTP health check (typically 1-5ms) is negligible compared to the cost of routing traffic to a broken backend.
Health Check Configuration in Kubernetes
Kubernetes has three probe types that operate at the application level:
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
ports:
- containerPort: 8080
# Liveness: is the container still running?
# Failure → restart the container
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
# Readiness: is the container ready for traffic?
# Failure → remove from Service endpoints (stop sending traffic)
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
# Startup: is the container still starting up?
# Failure → restart the container (gives slow apps time to start)
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s to start
Separate your liveness and readiness endpoints. The liveness probe answers "is this process broken beyond repair?" (restart it). The readiness probe answers "can this instance handle requests right now?" (stop sending traffic). A pod that is overloaded should fail readiness (remove from rotation) but NOT fail liveness (do not restart it — that makes things worse).
Part 4: Connection Draining — The Graceful Shutdown Problem
When a pod is being terminated (scaling down, rolling update, node drain), what happens to active connections? If the load balancer immediately stops sending traffic AND drops existing connections, users see errors.
Connection draining (also called "graceful shutdown" or "deregistration delay") solves this:
- Mark the backend as "draining" — stop sending NEW connections
- Allow EXISTING connections to complete (up to a timeout)
- After the timeout, forcefully close remaining connections
- Remove the backend entirely
# AWS ALB deregistration delay (Terraform)
resource "aws_lb_target_group" "app" {
deregistration_delay = 30 # seconds to wait for in-flight requests
}
# Kubernetes: Pod termination grace period
apiVersion: v1
kind: Pod
spec:
terminationGracePeriodSeconds: 30 # default, matches ALB
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"]
# Give kube-proxy time to update iptables rules
# BEFORE the app starts shutting down
There is a race condition in Kubernetes pod termination. When a pod is deleted, two things happen simultaneously: (1) the kubelet sends SIGTERM to the container, and (2) the Endpoints controller removes the pod from the Service. If the app shuts down before iptables rules are updated, in-flight requests from other pods hit a closed port. The preStop sleep (5-10 seconds) gives kube-proxy time to propagate the change before the app exits.
Pod Termination and Connection Draining
Click each step to explore
Part 5: When to Use L4 vs L7
The decision is not always obvious. Here is a practical framework:
Use L4 load balancing when:
- You need maximum throughput with minimum latency (every microsecond counts)
- The traffic is not HTTP (databases, message queues, custom TCP protocols, gRPC without path routing)
- You want TLS passthrough (the backend handles TLS, not the load balancer)
- You are load balancing across Availability Zones at the network level
Use L7 load balancing when:
- You need path-based or header-based routing (microservices behind one domain)
- You want the load balancer to terminate TLS (centralized certificate management)
- You need HTTP-level health checks (verify application health, not just port)
- You want advanced traffic management (canary deployments, A/B testing, rate limiting)
- You need request-level observability (HTTP status codes, latency per path)
In Kubernetes, you typically use both:
L4 + L7 in a Typical Kubernetes Setup
Hover components for details
The NLB (L4) handles raw TCP traffic from the internet, distributes it across Ingress Controller pods, and provides a static IP. The Ingress Controller (L7) terminates TLS, inspects HTTP requests, and routes /api/* to the API service and / to the web service. Inside the cluster, kube-proxy (L4) distributes traffic from the ClusterIP to individual pods.
If you only have one backend service (no path routing needed), you can skip the Ingress Controller entirely and use a Service type LoadBalancer with an NLB. This removes a hop, reduces latency, and simplifies your stack. Only add an Ingress Controller when you actually need L7 routing.
Key Concepts Summary
- L4 load balancing sees only IPs and ports — it forwards TCP/UDP packets without inspecting content, adding microseconds of latency
- L7 load balancing terminates connections and inspects HTTP — it can route by path, header, and cookie, adding milliseconds of latency
- L4 health checks (TCP SYN) only verify the port is open — a crashed application that still accepts TCP connections will pass
- L7 health checks (HTTP GET) verify the application responds correctly — always use these for HTTP services
- Connection draining prevents dropped requests during pod termination — configure
preStopsleep andterminationGracePeriodSeconds - Kubernetes uses both layers: NLB (L4) at the edge, Ingress Controller (L7) for routing, kube-proxy (L4) inside the cluster
- The preStop race condition is real — without a sleep, pods can receive traffic after they start shutting down
Common Mistakes
- Using L4 (TCP) health checks for HTTP services — the load balancer cannot detect application-level failures like 500 errors or deadlocks
- Forgetting the
preStopsleep hook — leads to dropped connections during rolling updates because iptables rules update after the app starts shutting down - Setting
terminationGracePeriodSecondstoo low — long-running requests get SIGKILL before they complete - Using sticky sessions as a permanent solution instead of fixing stateful application design
- Assuming round robin is "fair" — if one backend is slower, it accumulates connections and becomes a bottleneck (use least connections instead)
- Skipping the Ingress Controller when you need path-based routing — trying to do L7 routing with multiple LoadBalancer Services wastes cloud load balancers and IPs
Your HTTP service passes its TCP health check but returns HTTP 503 for every request. What type of load balancer health check would catch this failure?