TCP — The Three-Way Handshake & Connection Lifecycle
It is 2 AM and your pager fires. A critical microservice is unreachable. You SSH into the pod's node and run
telnet api-service 8080. It hangs. No connection refused, no timeout message — just silence.Is the service down? Is a firewall blocking traffic? Is TCP even reaching the server? You cannot answer any of these questions unless you understand how TCP connections actually work — what happens at the packet level when two machines try to talk.
This lesson gives you that understanding. By the end, you will be able to look at
ssoutput and immediately know whether a connection is stuck, half-open, or leaking.
Part 1: The Three-Way Handshake — SYN, SYN-ACK, ACK
Every TCP connection begins with a three-way handshake. No data flows until this handshake completes. No exceptions.
Why a Handshake Exists
TCP is a connection-oriented protocol. Unlike UDP, which fires packets blindly, TCP needs both sides to agree on initial parameters before any data is exchanged. The handshake establishes:
- Sequence numbers — each side picks a random starting number so they can track packet order
- Window size — how much data each side can buffer before needing an acknowledgment
- Maximum segment size (MSS) — the largest chunk of data that fits in one TCP segment
- Options — timestamps, selective acknowledgment (SACK), window scaling
The Three Steps
Here is what happens when your application calls connect() to a remote server:
Step 1: SYN (Client to Server)
The client sends a TCP segment with the SYN flag set. This segment contains:
- Source port (ephemeral, e.g., 52431)
- Destination port (e.g., 8080)
- Initial sequence number (ISN) — a random 32-bit number (e.g., 1000)
- Window size — how much data the client can receive
- MSS option — typically 1460 bytes on Ethernet
Step 2: SYN-ACK (Server to Client)
If the server is listening on port 8080, it responds with both SYN and ACK flags set:
- Acknowledges the client's sequence number (ACK = client ISN + 1, e.g., 1001)
- Sends its own initial sequence number (e.g., 5000)
- Its own window size and MSS
Step 3: ACK (Client to Server)
The client acknowledges the server's sequence number (ACK = server ISN + 1, e.g., 5001). The connection is now ESTABLISHED. Data can flow.
TCP Three-Way Handshake
Click each step to explore
The three-way handshake is not optional, not skippable, and not negotiable. Every single TCP connection — whether it carries one HTTP request or a million database queries — starts with SYN, SYN-ACK, ACK. This takes one full round-trip time (RTT). If your RTT to a service is 100ms, every new connection adds 100ms of latency before any data moves. This is why connection pooling and keep-alive matter enormously in distributed systems.
Watching the Handshake with tcpdump
You can see the handshake in real time:
# Capture the handshake to port 8080
sudo tcpdump -i eth0 -nn port 8080 -c 3
# Output:
# 10:00:00.000 IP 10.0.1.5.52431 > 10.0.2.10.8080: Flags [S], seq 1000, win 65535, options [mss 1460], length 0
# 10:00:00.001 IP 10.0.2.10.8080 > 10.0.1.5.52431: Flags [S.], seq 5000, ack 1001, win 65535, options [mss 1460], length 0
# 10:00:00.001 IP 10.0.1.5.52431 > 10.0.2.10.8080: Flags [.], ack 5001, win 65535, length 0
The flags tell the story: [S] is SYN, [S.] is SYN-ACK (the dot means ACK), [.] is ACK only. If you only see the first [S] packet repeated over and over with no [S.] response, the SYN is being dropped — likely by a firewall or security group.
When debugging connection issues, tcpdump on the destination host is your best friend. If you see SYN packets arriving but no SYN-ACK leaving, the problem is on the server side (service not listening, iptables DROP rule, or the application's listen backlog is full). If you do not even see SYN packets arriving, the problem is in the network path (firewall, security group, routing).
Part 2: Sequence Numbers and Reliable Delivery
TCP's killer feature is reliable, ordered delivery. Every byte sent through a TCP connection has a sequence number, and the receiver acknowledges each byte. If a segment is lost, the sender retransmits it.
How Sequence Numbers Work
After the handshake, the client's sequence number starts at ISN + 1 (e.g., 1001). When the client sends 500 bytes of data, the segment has:
- Sequence number: 1001
- Data length: 500
The next segment will have sequence number 1501. The receiver sends back an ACK with number 1501, meaning "I have received everything up to byte 1501, send me the next one."
# See sequence numbers in action
sudo tcpdump -i eth0 -nn port 8080 -S
# -S shows absolute sequence numbers (not relative)
# 10.0.1.5.52431 > 10.0.2.10.8080: seq 1001:1501, ack 5001, length 500
# 10.0.2.10.8080 > 10.0.1.5.52431: ack 1501, length 0
# 10.0.1.5.52431 > 10.0.2.10.8080: seq 1501:2001, ack 5001, length 500
Sequence numbers are 32-bit unsigned integers (0 to 4,294,967,295). On a 10 Gbps link, these numbers wrap around in about 3.4 seconds. TCP handles this with the PAWS (Protection Against Wrapped Sequences) option using timestamps. If you see mysterious connection resets on high-throughput links, check that TCP timestamps are enabled (net.ipv4.tcp_timestamps = 1).
Retransmission: What Happens When Packets Are Lost
When a sender does not receive an ACK within the retransmission timeout (RTO), it resends the segment. The initial RTO is calculated from the observed round-trip time (RTT) using an algorithm called Karn/Partridge with Jacobson/Karels smoothing. In practice:
- First retransmit: after ~200ms (depends on RTT)
- Second retransmit: ~400ms (doubles each time — exponential backoff)
- Third retransmit: ~800ms
- After 15 retries (default on Linux): the connection is abandoned (~15 minutes total)
# Check retransmission settings
sysctl net.ipv4.tcp_retries2
# net.ipv4.tcp_retries2 = 15
# Monitor retransmissions in real time
ss -ti | grep retrans
# cubic rto:204 ... retrans:0/3
We had a Kubernetes cluster where pods talking to an external database would randomly hang for exactly 15 minutes before timing out. The database was behind a stateful firewall that silently dropped idle connections after 10 minutes. TCP would retransmit 15 times with exponential backoff, taking roughly 15 minutes before giving up. The fix was setting net.ipv4.tcp_retries2 = 5 on the nodes (about 13 seconds max) and configuring TCP keepalive on the database connection pool. Do not rely on the default 15 retries in production.
Part 3: TCP Connection States
Every TCP connection exists in one of several states. Understanding these states is critical for debugging connection problems in production.
The Full State Machine
| State | Description | Who is in this state |
|---|---|---|
| LISTEN | Waiting for incoming connections | Server |
| SYN_SENT | SYN sent, waiting for SYN-ACK | Client |
| SYN_RECEIVED | SYN-ACK sent, waiting for final ACK | Server |
| ESTABLISHED | Connection is open, data flowing | Both |
| FIN_WAIT_1 | Sent FIN, waiting for ACK | Closer (active close) |
| FIN_WAIT_2 | Got ACK of FIN, waiting for remote FIN | Closer |
| CLOSE_WAIT | Got FIN, waiting for application to close | Remote side |
| LAST_ACK | Sent FIN (after CLOSE_WAIT), waiting for ACK | Remote side |
| TIME_WAIT | Both FINs exchanged, waiting before final close | Closer |
| CLOSED | Connection is fully closed | Both |
TCP Connection States — The Full Lifecycle
Hover components for details
The States That Cause Problems
TIME_WAIT — The Most Misunderstood State
When the side that initiates the close (sends the first FIN) completes the four-way teardown, it enters TIME_WAIT for 2x the Maximum Segment Lifetime (MSL). On Linux, this is 60 seconds (2 x 30s).
Why does TIME_WAIT exist? Two reasons:
- Ensure the final ACK was received. If the remote side's FIN was lost and retransmitted, the closing side needs to be around to re-ACK it.
- Prevent old packets from corrupting new connections. If a new connection immediately reuses the same source IP, source port, destination IP, destination port tuple, delayed packets from the old connection could be misinterpreted.
# Count connections in TIME_WAIT
ss -s
# TCP: 1542 (estab 340, closed 200, orphaned 0, timewait 982)
# ^^^^^^^^^
# 982 connections waiting to die. Each holds a port.
# List TIME_WAIT connections
ss -tan state time-wait
TIME_WAIT is not a bug. It is a safety mechanism. But in high-throughput environments (like a K8s node proxying thousands of requests per second), TIME_WAIT accumulation can exhaust ephemeral ports. The fix is NOT to reduce TIME_WAIT duration — it is to use connection pooling and keep-alive so you make fewer connections in the first place.
CLOSE_WAIT — The Leak Detector
CLOSE_WAIT means: "The remote side sent FIN (it is done), but your application has not called close() on the socket yet." If you see connections stuck in CLOSE_WAIT, it means your application has a connection leak — it is not closing sockets after the remote side disconnects.
# Find CLOSE_WAIT connections — these are YOUR app leaking
ss -tan state close-wait
# State Recv-Q Send-Q Local Address:Port Peer Address:Port
# CLOSE-WAIT 0 0 10.0.1.5:52431 10.0.2.10:5432
# CLOSE-WAIT 0 0 10.0.1.5:52432 10.0.2.10:5432
# CLOSE-WAIT 0 0 10.0.1.5:52433 10.0.2.10:5432
# ^^^ Three leaked database connections
A growing count of CLOSE_WAIT connections is almost always an application bug. The remote side has closed its end, but your application is holding the socket open — probably because it lost track of the connection object. This eventually leads to file descriptor exhaustion and your application refusing new connections with "Too many open files." Fix the application code, not the kernel settings.
Part 4: TCP Keepalive — Detecting Dead Connections
TCP keepalive is a mechanism for detecting dead connections when no data is flowing. After a period of silence, the OS sends a tiny probe packet. If no response comes after several retries, the connection is considered dead.
The Default Settings (and Why They Are Wrong for K8s)
# Linux TCP keepalive defaults
sysctl net.ipv4.tcp_keepalive_time
# net.ipv4.tcp_keepalive_time = 7200 (wait 2 HOURS before first probe)
sysctl net.ipv4.tcp_keepalive_intvl
# net.ipv4.tcp_keepalive_intvl = 75 (75 seconds between probes)
sysctl net.ipv4.tcp_keepalive_probes
# net.ipv4.tcp_keepalive_probes = 9 (give up after 9 failed probes)
With defaults: if a remote pod dies silently (no FIN sent — think kill -9 or node failure), it takes 2 hours + 9 x 75 seconds = 2 hours 11 minutes before your application even knows the connection is dead. In a Kubernetes environment where pods restart in seconds, this is absurd.
# Reasonable keepalive settings for Kubernetes
sysctl -w net.ipv4.tcp_keepalive_time=30 # First probe after 30 seconds idle
sysctl -w net.ipv4.tcp_keepalive_intvl=10 # Probe every 10 seconds
sysctl -w net.ipv4.tcp_keepalive_probes=3 # Give up after 3 failed probes
# Total detection time: 30 + (3 × 10) = 60 seconds
Most application-level connection pools (database pools, HTTP clients, gRPC channels) have their own keepalive mechanisms that are faster and more reliable than TCP keepalive. Always configure application-level keepalive first. Use TCP keepalive as a safety net, not as the primary dead-connection detector. For gRPC in K8s, set GRPC_KEEPALIVE_TIME_MS=10000 (10 seconds) on your client.
Part 5: Connection Reset (RST) — The Emergency Stop
A TCP RST (reset) immediately tears down a connection without the graceful four-way close. No FIN, no waiting — just "this connection is dead, right now."
When RSTs Happen
- Connecting to a port with no listener. If nothing is listening on port 8080, the kernel responds to SYN with RST. This is the "Connection refused" error.
- Application crashes. If a process crashes without closing its sockets, the kernel sends RST to all connected peers.
- Firewall intervention. Some firewalls send RST to actively reject connections (instead of silently dropping).
- Half-open connection detected. If one side thinks the connection is alive but the other has forgotten about it (after a reboot, for example), the first data packet triggers an RST.
- TCP resource limits. If the listen backlog is full, some configurations respond with RST.
# Filter for RST packets in tcpdump
sudo tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-rst) != 0' -c 5
# In ss output, look for connections that vanish suddenly
# (they won't show up — RST closes them immediately)
A team migrated their API gateway to a new Kubernetes cluster and suddenly 5% of requests failed with "connection reset by peer." The root cause: the new cluster had a Network Policy that allowed traffic on port 443 but the backend pods were listening on port 8443. The gateway's health checks passed (they checked a different port), but actual traffic hit port 8443, got RST from the kernel (no listener), and the gateway reported it as "connection reset." The fix was a two-line change to the Network Policy. Always check that your allowed ports match your actual listening ports.
Part 6: Window Size and Flow Control
TCP flow control prevents a fast sender from overwhelming a slow receiver. Each side advertises a receive window — the amount of data it can buffer.
How It Works
- The receiver advertises its window size in every ACK: "I can accept 65,535 more bytes."
- The sender never sends more than the advertised window without getting an ACK.
- If the receiver is slow (maybe its application is busy), it advertises a smaller window.
- If the window hits zero, the sender pauses. This is called zero-window — a sign that the receiver is overwhelmed.
# Detect zero-window situations
ss -ti dst 10.0.2.10 | grep -i "rcv_space\|snd_wnd"
# snd_wnd:0 indicates the remote side has a zero window
# In tcpdump
sudo tcpdump -i eth0 -nn port 8080 | grep "win 0"
# 10.0.2.10.8080 > 10.0.1.5.52431: Flags [.], ack 100001, win 0, length 0
If you see zero-window conditions in your connections, the bottleneck is the receiver, not the network. The receiver's application is not reading data from the socket fast enough. Common causes in K8s: CPU throttling on the receiving pod (increase CPU limits), garbage collection pauses in Java applications, or a single-threaded event loop that is blocked.
Part 7: Inspecting Connections with ss and netstat
The ss command (socket statistics) is the modern replacement for netstat. It is faster and provides more detail.
# Show all TCP connections with state
ss -tan
# State Recv-Q Send-Q Local Address:Port Peer Address:Port
# LISTEN 0 128 0.0.0.0:8080 0.0.0.0:*
# ESTAB 0 0 10.0.1.5:52431 10.0.2.10:5432
# TIME-WAIT 0 0 10.0.1.5:52432 10.0.2.10:5432
# Count connections by state
ss -tan | tail -n +2 | awk '{print $1}' | sort | uniq -c | sort -rn
# 340 ESTAB
# 982 TIME-WAIT
# 12 CLOSE-WAIT
# 3 FIN-WAIT-2
# 1 LISTEN
# Show connections to a specific destination
ss -tan dst 10.0.2.10
# Show listening sockets with the process name
ss -tlnp
# LISTEN 0 128 0.0.0.0:8080 0.0.0.0:* users:(("nginx",pid=1234,fd=6))
# Show detailed TCP info (RTT, retransmissions, window)
ss -ti dst 10.0.2.10
# ESTAB 0 0 10.0.1.5:52431 10.0.2.10:5432
# cubic rto:204 rtt:1.5/0.5 mss:1448 rcv_space:14480 send 77.2Mbps
In a Kubernetes debugging session, run ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn inside the pod or on the node. If you see hundreds of TIME_WAIT connections to the same destination, your application is creating new TCP connections for every request instead of using connection pooling. If you see growing CLOSE_WAIT, your application is leaking connections. This single command tells you more about connection health than most dashboards.
Key Concepts Summary
- The three-way handshake (SYN, SYN-ACK, ACK) is required for every TCP connection and costs one full round-trip time of latency
- Sequence numbers enable reliable, ordered delivery — every byte is tracked and acknowledged
- Retransmission handles packet loss with exponential backoff — default 15 retries over ~15 minutes on Linux
- TCP connection states tell you exactly what is happening: ESTABLISHED is healthy, TIME_WAIT is normal cleanup, CLOSE_WAIT is a leak
- TIME_WAIT exists to prevent packet confusion between old and new connections on the same port tuple — do not disable it, use connection pooling instead
- CLOSE_WAIT means your application is not closing sockets — this is always an application bug
- TCP keepalive defaults (2 hours) are far too slow for Kubernetes — configure 30-60 second detection
- RST (reset) is an immediate teardown — "Connection refused" means RST in response to SYN
- Zero-window means the receiver is overwhelmed — the bottleneck is the application, not the network
- ss is your primary tool for inspecting connection states on Linux
Common Mistakes
- Trying to eliminate TIME_WAIT by setting
net.ipv4.tcp_tw_reusewithout understanding the consequences — this can cause data corruption on connections to the same destination - Ignoring CLOSE_WAIT connections — they indicate application-level connection leaks that will eventually exhaust file descriptors
- Relying on TCP keepalive defaults (2 hours) in Kubernetes — pods restart in seconds, but dead connections linger for hours
- Blaming the network when
telnethangs — a hangingtelnetmeans SYN packets are being silently dropped, which is a firewall or security group issue, not a network outage - Not using connection pooling — creating a new TCP connection for every request wastes time on handshakes and accumulates TIME_WAIT
- Confusing "Connection refused" (RST — the port has no listener) with "Connection timed out" (no response at all — likely a firewall dropping packets)
- Setting
tcp_retries2too low without understanding that it affects all TCP connections on the system, not just the problematic ones
A pod has 200 connections in CLOSE_WAIT state to a database. What does this indicate?