All posts
Networking

Your gRPC Connection Worked for an Hour, Then Stopped. Welcome to Keepalive Hell.

gRPC connections silently die behind load balancers, NAT gateways, and idle timeouts. The keepalive settings that prevent this are documented separately on every side and you need all four to agree.

By Sharon Sahadevan··10 min read

You ship a microservice that calls a gRPC backend. Everything works in load tests. In production, after about an hour of normal traffic, requests start failing with:

Error: rpc error: code = Unavailable desc = transport is closing
Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: i/o timeout"

The pattern: works fine, then a wave of failures, then it recovers. The gap between batches is roughly the same: 60 minutes, 120 minutes, 240 minutes. Ring any bells? It is the AWS NLB idle timeout (350 seconds default) or your CNI's NAT timeout (3600 seconds default) silently dropping the TCP connection while gRPC thinks it is still alive.

This is keepalive hell. gRPC has its own keepalive system; HTTP/2 has its own ping mechanism; load balancers have their own idle timers; cloud NAT gateways have theirs. Get any one wrong and your long-lived connections die in production in a way that load tests never reproduce.

This post is the production guide to gRPC keepalives: what each layer does, what the right values are, and the diagnostic that finds which layer is killing your connections.

Why gRPC connections die silently#

A gRPC connection is a long-lived HTTP/2 connection (TCP underneath). HTTP/2 was designed for browser-server use where connections last seconds, not for backend-to-backend RPC where they last hours. Several intermediate layers can drop the connection without telling either end:

1. Load balancer idle timeout. AWS NLB defaults to 350 seconds. AWS ALB defaults to 60 seconds. GCP load balancers vary. After the timeout with no traffic, the LB closes the connection. The client and server may not get a FIN; the LB just stops forwarding.

2. NAT gateway / source NAT timeout. Cloud NAT gateways and Kubernetes' own kube-proxy iptables/IPVS NAT have idle connection tables. AWS NAT Gateway: 350 seconds for idle TCP. After timeout, the entry is evicted; subsequent packets get dropped because there is no NAT mapping.

3. Stateful firewall idle timeout. Same idea, different layer. Often 1 hour.

4. Connection-pooling middleware (Envoy, Istio sidecar). Service mesh sidecars have their own connection pools with idle timeouts. Default in Envoy: connection_idle_timeout = 1 hour.

When the connection is killed by an intermediary, neither client nor server sees the close. They both think the connection is fine. The client sends the next RPC; the request times out or gets RST'd; gRPC reports Unavailable.

The fix: keepalives#

A keepalive is a no-op packet sent periodically to keep the connection "active" from every intermediary's perspective. As long as packets flow more often than the shortest idle timeout in the path, no intermediary thinks the connection is idle.

gRPC has two layers of keepalive:

TCP keepalive: kernel-level. Sends an empty TCP segment when idle. Off by default; turned on with SO_KEEPALIVE socket option, controlled by net.ipv4.tcp_keepalive_time (default 7200s = 2 hours).

HTTP/2 PING frame: gRPC-level. Sends a HTTP/2 PING when the connection is idle. Configured with gRPC's keepalive options.

Almost everyone uses HTTP/2 PING (the gRPC way) instead of TCP keepalive, because:

  • The TCP keepalive defaults are too long (2 hours).
  • TCP keepalive cannot be configured per-application without root for sysctl.
  • HTTP/2 PING is application-level and works through HTTP/2-aware proxies.

The four gRPC keepalive options#

gRPC has four knobs you need to know:

On the client:

  • keepalive_time_ms: how often to send a PING when no other traffic is happening. Default: infinity (no keepalive). Set this to smaller than the shortest idle timeout in your network path.
  • keepalive_timeout_ms: how long to wait for a PING ACK before considering the connection dead. Default: 20 seconds.
  • keepalive_permit_without_calls: allow PINGs even when no RPCs are in flight. Default: false. Set this to true for backend services with periodic-but-quiet load.

On the server:

  • permit_keepalive_without_calls: allow clients to send PINGs even with no RPCs. Default: false. Set this to true if your clients use keepalive_permit_without_calls=true (otherwise the server rejects their PINGs as "too many pings").
  • permit_keepalive_time_ms: minimum interval the server permits between client PINGs. Default: 5 minutes. Lower this if your client PING interval is shorter.
  • max_connection_idle_ms: server-side timeout for idle connections (server proactively closes). Default: infinity.
  • max_connection_age_ms: maximum lifetime of any connection (server kicks clients to redistribute load). Default: infinity. Set this to ~30 minutes for load-balanced setups so that connections rotate across new backend pods.

If you do not configure both sides, the server will silently reject your client's keepalives with "ENHANCE_YOUR_CALM: too_many_pings" and forcibly close the connection.

The values that actually work in Kubernetes#

For a typical Kubernetes setup with AWS NLB and kube-proxy:

Client (Go example):

import "google.golang.org/grpc"
import "google.golang.org/grpc/keepalive"

conn, err := grpc.NewClient(
    "my-service:50051",
    grpc.WithTransportCredentials(creds),
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time:                10 * time.Second,  // ping every 10s when idle
        Timeout:             3 * time.Second,   // wait 3s for ACK
        PermitWithoutStream: true,              // ping even with no RPCs
    }),
)

Server (Go example):

server := grpc.NewServer(
    grpc.KeepaliveParams(keepalive.ServerParameters{
        MaxConnectionIdle:     15 * time.Minute, // close idle clients eventually
        MaxConnectionAge:      30 * time.Minute, // rotate connections for LB
        MaxConnectionAgeGrace: 5 * time.Second,  // grace before forcing close
        Time:                  10 * time.Second, // ping every 10s
        Timeout:               3 * time.Second,
    }),
    grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
        MinTime:             5 * time.Second, // permit client pings as fast as 5s
        PermitWithoutStream: true,
    }),
)

Why these numbers:

  • 10s ping interval: shorter than any reasonable idle timeout (NLB 350s, NAT 350s, mesh 1h). Plenty of safety margin.
  • 5s minimum permitted: matches client setting; server allows up to 2x faster than client sends.
  • 30 minutes max age: forces clients to reconnect periodically. Critical for load balancing: without this, gRPC's connection-stickiness means new server pods get no traffic.

The MaxConnectionAge gotcha#

gRPC's load balancing model is "one persistent connection per client per backend." If a server pod is added later, existing clients never connect to it because their existing connection is still alive.

MaxConnectionAge on the server fixes this by forcing clients to reconnect periodically. When a client reconnects, gRPC's resolver picks a new backend (often a new pod). Without this, traffic distribution stays unbalanced indefinitely.

For Kubernetes Services backed by Deployments that scale up and down, set this to 5-30 minutes. For Services that are stable (a fixed-size StatefulSet), longer is fine.

Diagnosing "transport is closing" in production#

Step 1: confirm it really is keepalive.

# On the client pod, check connection age
ss -tnp | grep $SERVER_IP

# On the server, check accept rate (should match new connections from MaxConnectionAge)
kubectl exec -it $SERVER_POD -- ss -tn state established | wc -l

If you see connections that are exactly N hours old (N = your NLB idle timeout), the LB is killing them.

Step 2: look at the transport_security logs.

# Enable gRPC tracing (verbose)
GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info \
  ./your-app

Look for transport.go: closing connection and client transport: GOAWAY messages. GOAWAY from server with code "ENHANCE_YOUR_CALM" means your client pings are too aggressive for the server's MinTime.

Step 3: tcpdump for the smoking gun.

# Capture HTTP/2 frames (port 50051 is gRPC)
kubectl debug -it node/$NODE --image=nicolaka/netshoot
$ tcpdump -i any -n -X 'port 50051'

Look for PING frames (HTTP/2 frame type 0x06). Counts:

  • PINGs flowing both directions: keepalive is working.
  • PINGs from client, no ACK from server: server is overloaded or the connection is already broken.
  • No PINGs at all: keepalive is not configured.

Common production mistakes#

1. Keepalive set on client only. Server defaults reject "too many pings" and disconnect. Both sides need configuration.

2. PING interval longer than the shortest idle timeout. Common: setting Time: 60 * time.Second when the NLB idle timeout is 350s. Works in load tests; fails in production after random idle periods. Set the interval to a fraction of the shortest timeout (1/10 to 1/30).

3. PermitWithoutStream: false on backend services. Backend gRPC services often have idle periods between traffic spikes; without PermitWithoutStream: true, no keepalives are sent during quiet periods, and the connection is killed before the next spike.

4. MaxConnectionAge not set on servers behind a load balancer. Connections never rotate; new pods get zero traffic. Set this for any server behind a Service.

5. Sidecar mesh defaults overriding gRPC settings. Istio's Envoy has its own connection_idle_timeout. If it is shorter than your gRPC keepalive interval, Envoy kills the connection from inside the cluster. Tune Envoy too:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: my-service
spec:
  host: my-service
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 5s
        tcpKeepalive:
          time: 30s
          interval: 5s
      http:
        h2UpgradePolicy: UPGRADE
        idleTimeout: 1h

6. Client retries that hide the problem. A client with aggressive retries on Unavailable masks the keepalive issue from monitoring. The retries succeed (on a new connection), so user-facing latency is fine, but you are paying connection-setup latency on every spike. Add a metric for "RPC failed with Unavailable" separate from "RPC succeeded after retry" to see the leading indicator.

Cloud-specific timeouts to know#

Quick reference for the timeouts that bite:

ComponentDefault idle timeoutConfigurable?
AWS NLB (TCP)350sNo (fixed)
AWS ALB (HTTP)60sYes (1-4000s)
AWS NAT Gateway350s (TCP idle)No
GCP Network LBvariesYes
GCP Cloud NATvaries (10 min default)Yes
Azure LB4 minutesYes (4-30 min)
Linux conntrack5 daysYes (nf_conntrack_tcp_timeout_established)
iptables/IPVS NAT15 minYes
Envoy default idle1 hourYes
Istio sidecar default1 hourYes

The shortest of these on your path determines your keepalive interval.

Quick reference: the gRPC-down-after-an-hour checklist#

1. Confirm the pattern: failures are bursty, with gaps matching a known idle timeout
   - If gap is ~6 minutes: AWS NLB
   - If gap is ~1 hour: NAT or sidecar
   - If gap is exactly 5 days: conntrack

2. Check current keepalive config:
   - Client: grpc.WithKeepaliveParams set?
   - Server: grpc.KeepaliveParams + KeepaliveEnforcementPolicy set?

3. tcpdump for HTTP/2 PING frames:
   - Both directions = keepalive working
   - No pings = misconfigured
   - PING from client + GOAWAY from server = ENHANCE_YOUR_CALM rejection

4. Set both sides:
   - Client: Time=10s, Timeout=3s, PermitWithoutStream=true
   - Server: matching MinTime=5s, MaxConnectionAge=30m

5. If using a service mesh, configure mesh-side too:
   - Istio DestinationRule with idleTimeout: 1h
   - Linkerd's defaults are usually OK

6. Set up alerts:
   - rate(grpc_client_handled_total{grpc_code="Unavailable"}) > 0
   - Track p99 connection age vs MaxConnectionAge

The mental model#

Long-lived TCP connections live or die based on the shortest idle timeout in their path. Cloud LBs, NAT, sidecars, and conntrack tables each have their own. gRPC keepalives work by ensuring the connection is never idle long enough for any of them to give up.

The default of "no keepalive" is unsafe for any backend gRPC connection that lasts more than a few minutes. Both sides must be configured. The values are not magic: pick a ping interval shorter than the shortest path timeout, divide by 10 for safety margin, set MaxConnectionAge on the server so connections rotate.

After you set this up correctly once, gRPC stops being mysterious. Connections live as long as you want; load balancing actually balances; "transport is closing" goes away.


The full networking layer (TCP, conntrack, HTTP/2, gRPC, service mesh, kube-proxy) is covered in the Networking Fundamentals course. The Kubernetes-specific debugging patterns (tcpdump in pods, kube-proxy modes, mesh inspection) are part of the Kubernetes Debugging course.