Networking Fundamentals for Engineers

TCP: The Three-Way Handshake & Connection Lifecycle

It is 2 AM and your pager fires. A critical microservice is unreachable. You SSH into the pod's node and run telnet api-service 8080. It hangs. No connection refused, no timeout message, just silence.

Is the service down? Is a firewall blocking traffic? Is TCP even reaching the server? You cannot answer any of these questions unless you understand how TCP connections actually work, what happens at the packet level when two machines try to talk.

This lesson gives you that understanding. By the end, you will be able to look at ss output and immediately know whether a connection is stuck, half-open, or leaking.

Part 1: The Three-Way Handshake: SYN, SYN-ACK, ACK

Every TCP connection begins with a three-way handshake. No data flows until this handshake completes. No exceptions.

Why a Handshake Exists

TCP is a connection-oriented protocol. Unlike UDP, which fires packets blindly, TCP needs both sides to agree on initial parameters before any data is exchanged. The handshake establishes:

Sequence numbers: each side picks a random starting number so they can track packet order
Window size: how much data each side can buffer before needing an acknowledgment
Maximum segment size (MSS): the largest chunk of data that fits in one TCP segment
Options: timestamps, selective acknowledgment (SACK), window scaling

The Three Steps

Here is what happens when your application calls connect() to a remote server:

Step 1: SYN (Client to Server)

The client sends a TCP segment with the SYN flag set. This segment contains:

Source port (ephemeral, e.g., 52431)
Destination port (e.g., 8080)
Initial sequence number (ISN): a random 32-bit number (e.g., 1000)
Window size: how much data the client can receive
MSS option: typically 1460 bytes on Ethernet

Step 2: SYN-ACK (Server to Client)

If the server is listening on port 8080, it responds with both SYN and ACK flags set:

Acknowledges the client's sequence number (ACK = client ISN + 1, e.g., 1001)
Sends its own initial sequence number (e.g., 5000)
Its own window size and MSS

Step 3: ACK (Client to Server)

The client acknowledges the server's sequence number (ACK = server ISN + 1, e.g., 5001). The connection is now ESTABLISHED. Data can flow.

TCP Three-Way Handshake

Click each step to explore

KEY CONCEPT

The three-way handshake is not optional, not skippable, and not negotiable. Every single TCP connection: whether it carries one HTTP request or a million database queries: starts with SYN, SYN-ACK, ACK. This takes one full round-trip time (RTT). If your RTT to a service is 100ms, every new connection adds 100ms of latency before any data moves. This is why connection pooling and keep-alive matter enormously in distributed systems.

Watching the Handshake with tcpdump

You can see the handshake in real time:

# Capture the handshake to port 8080
sudo tcpdump -i eth0 -nn port 8080 -c 3

# Output:
# 10:00:00.000 IP 10.0.1.5.52431 > 10.0.2.10.8080: Flags [S], seq 1000, win 65535, options [mss 1460], length 0
# 10:00:00.001 IP 10.0.2.10.8080 > 10.0.1.5.52431: Flags [S.], seq 5000, ack 1001, win 65535, options [mss 1460], length 0
# 10:00:00.001 IP 10.0.1.5.52431 > 10.0.2.10.8080: Flags [.], ack 5001, win 65535, length 0

The flags tell the story: [S] is SYN, [S.] is SYN-ACK (the dot means ACK), [.] is ACK only. If you only see the first [S] packet repeated over and over with no [S.] response, the SYN is being dropped, likely by a firewall or security group.

PRO TIP

When debugging connection issues, tcpdump on the destination host is your best friend. If you see SYN packets arriving but no SYN-ACK leaving, the problem is on the server side (service not listening, iptables DROP rule, or the application's listen backlog is full). If you do not even see SYN packets arriving, the problem is in the network path (firewall, security group, routing).

Part 2: Sequence Numbers and Reliable Delivery

TCP's killer feature is reliable, ordered delivery. Every byte sent through a TCP connection has a sequence number, and the receiver acknowledges each byte. If a segment is lost, the sender retransmits it.

How Sequence Numbers Work

After the handshake, the client's sequence number starts at ISN + 1 (e.g., 1001). When the client sends 500 bytes of data, the segment has:

Sequence number: 1001
Data length: 500

The next segment will have sequence number 1501. The receiver sends back an ACK with number 1501, meaning "I have received everything up to byte 1501, send me the next one."

# See sequence numbers in action
sudo tcpdump -i eth0 -nn port 8080 -S

# -S shows absolute sequence numbers (not relative)
# 10.0.1.5.52431 > 10.0.2.10.8080: seq 1001:1501, ack 5001, length 500
# 10.0.2.10.8080 > 10.0.1.5.52431: ack 1501, length 0
# 10.0.1.5.52431 > 10.0.2.10.8080: seq 1501:2001, ack 5001, length 500

WARNING

Sequence numbers are 32-bit unsigned integers (0 to 4,294,967,295). On a 10 Gbps link, these numbers wrap around in about 3.4 seconds. TCP handles this with the PAWS (Protection Against Wrapped Sequences) option using timestamps. If you see mysterious connection resets on high-throughput links, check that TCP timestamps are enabled (net.ipv4.tcp_timestamps = 1).

Retransmission: What Happens When Packets Are Lost

When a sender does not receive an ACK within the retransmission timeout (RTO), it resends the segment. The initial RTO is calculated from the observed round-trip time (RTT) using an algorithm called Karn/Partridge with Jacobson/Karels smoothing. In practice:

First retransmit: after ~200ms (depends on RTT)
Second retransmit: ~400ms (doubles each time, exponential backoff)
Third retransmit: ~800ms
After 15 retries (default on Linux): the connection is abandoned (~15 minutes total)

# Check retransmission settings
sysctl net.ipv4.tcp_retries2
# net.ipv4.tcp_retries2 = 15

# Monitor retransmissions in real time
ss -ti | grep retrans
# cubic rto:204 ... retrans:0/3

WAR STORY

We had a Kubernetes cluster where pods talking to an external database would randomly hang for exactly 15 minutes before timing out. The database was behind a stateful firewall that silently dropped idle connections after 10 minutes. TCP would retransmit 15 times with exponential backoff, taking roughly 15 minutes before giving up. The fix was setting net.ipv4.tcp_retries2 = 5 on the nodes (about 13 seconds max) and configuring TCP keepalive on the database connection pool. Do not rely on the default 15 retries in production.

Part 3: TCP Connection States

Every TCP connection exists in one of several states. Understanding these states is critical for debugging connection problems in production.

The Full State Machine

State	Description	Who is in this state
LISTEN	Waiting for incoming connections	Server
SYN_SENT	SYN sent, waiting for SYN-ACK	Client
SYN_RECEIVED	SYN-ACK sent, waiting for final ACK	Server
ESTABLISHED	Connection is open, data flowing	Both
FIN_WAIT_1	Sent FIN, waiting for ACK	Closer (active close)
FIN_WAIT_2	Got ACK of FIN, waiting for remote FIN	Closer
CLOSE_WAIT	Got FIN, waiting for application to close	Remote side
LAST_ACK	Sent FIN (after CLOSE_WAIT), waiting for ACK	Remote side
TIME_WAIT	Both FINs exchanged, waiting before final close	Closer
CLOSED	Connection is fully closed	Both

TCP Connection States, The Full Lifecycle

LISTEN

SYN_SENT

SYN_RECEIVED

ESTABLISHED

FIN_WAIT_1

FIN_WAIT_2

CLOSE_WAIT

LAST_ACK

TIME_WAIT (60s)

CLOSED

Hover components for details

The States That Cause Problems

TIME_WAIT, The Most Misunderstood State

When the side that initiates the close (sends the first FIN) completes the four-way teardown, it enters TIME_WAIT for 2x the Maximum Segment Lifetime (MSL). On Linux, this is 60 seconds (2 x 30s).

Why does TIME_WAIT exist? Two reasons:

Ensure the final ACK was received. If the remote side's FIN was lost and retransmitted, the closing side needs to be around to re-ACK it.
Prevent old packets from corrupting new connections. If a new connection immediately reuses the same source IP, source port, destination IP, destination port tuple, delayed packets from the old connection could be misinterpreted.

# Count connections in TIME_WAIT
ss -s
# TCP:   1542 (estab 340, closed 200, orphaned 0, timewait 982)
#                                                   ^^^^^^^^^
# 982 connections waiting to die. Each holds a port.

# List TIME_WAIT connections
ss -tan state time-wait

KEY CONCEPT

TIME_WAIT is not a bug. It is a safety mechanism. But in high-throughput environments (like a K8s node proxying thousands of requests per second), TIME_WAIT accumulation can exhaust ephemeral ports. The fix is NOT to reduce TIME_WAIT duration, it is to use connection pooling and keep-alive so you make fewer connections in the first place.

CLOSE_WAIT, The Leak Detector

CLOSE_WAIT means: "The remote side sent FIN (it is done), but your application has not called close() on the socket yet." If you see connections stuck in CLOSE_WAIT, it means your application has a connection leak, it is not closing sockets after the remote side disconnects.

# Find CLOSE_WAIT connections — these are YOUR app leaking
ss -tan state close-wait
# State      Recv-Q Send-Q Local Address:Port   Peer Address:Port
# CLOSE-WAIT 0      0      10.0.1.5:52431       10.0.2.10:5432
# CLOSE-WAIT 0      0      10.0.1.5:52432       10.0.2.10:5432
# CLOSE-WAIT 0      0      10.0.1.5:52433       10.0.2.10:5432
# ^^^ Three leaked database connections

WARNING

A growing count of CLOSE_WAIT connections is almost always an application bug. The remote side has closed its end, but your application is holding the socket open, probably because it lost track of the connection object. This eventually leads to file descriptor exhaustion and your application refusing new connections with "Too many open files." Fix the application code, not the kernel settings.

Part 4: TCP Keepalive: Detecting Dead Connections

TCP keepalive is a mechanism for detecting dead connections when no data is flowing. After a period of silence, the OS sends a tiny probe packet. If no response comes after several retries, the connection is considered dead.

The Default Settings (and Why They Are Wrong for K8s)

# Linux TCP keepalive defaults
sysctl net.ipv4.tcp_keepalive_time
# net.ipv4.tcp_keepalive_time = 7200    (wait 2 HOURS before first probe)

sysctl net.ipv4.tcp_keepalive_intvl
# net.ipv4.tcp_keepalive_intvl = 75     (75 seconds between probes)

sysctl net.ipv4.tcp_keepalive_probes
# net.ipv4.tcp_keepalive_probes = 9     (give up after 9 failed probes)

With defaults: if a remote pod dies silently (no FIN sent, think kill -9 or node failure), it takes 2 hours + 9 x 75 seconds = 2 hours 11 minutes before your application even knows the connection is dead. In a Kubernetes environment where pods restart in seconds, this is absurd.

# Reasonable keepalive settings for Kubernetes
sysctl -w net.ipv4.tcp_keepalive_time=30    # First probe after 30 seconds idle
sysctl -w net.ipv4.tcp_keepalive_intvl=10   # Probe every 10 seconds
sysctl -w net.ipv4.tcp_keepalive_probes=3   # Give up after 3 failed probes
# Total detection time: 30 + (3 × 10) = 60 seconds

PRO TIP

Most application-level connection pools (database pools, HTTP clients, gRPC channels) have their own keepalive mechanisms that are faster and more reliable than TCP keepalive. Always configure application-level keepalive first. Use TCP keepalive as a safety net, not as the primary dead-connection detector. For gRPC in K8s, set GRPC_KEEPALIVE_TIME_MS=10000 (10 seconds) on your client.

Part 5: Connection Reset (RST): The Emergency Stop

A TCP RST (reset) immediately tears down a connection without the graceful four-way close. No FIN, no waiting, just "this connection is dead, right now."

When RSTs Happen

Connecting to a port with no listener. If nothing is listening on port 8080, the kernel responds to SYN with RST. This is the "Connection refused" error.
Application crashes. If a process crashes without closing its sockets, the kernel sends RST to all connected peers.
Firewall intervention. Some firewalls send RST to actively reject connections (instead of silently dropping).
Half-open connection detected. If one side thinks the connection is alive but the other has forgotten about it (after a reboot, for example), the first data packet triggers an RST.
TCP resource limits. If the listen backlog is full, some configurations respond with RST.

# Filter for RST packets in tcpdump
sudo tcpdump -i eth0 -nn 'tcp[tcpflags] & (tcp-rst) != 0' -c 5

# In ss output, look for connections that vanish suddenly
# (they won't show up — RST closes them immediately)

WAR STORY

A team migrated their API gateway to a new Kubernetes cluster and suddenly 5% of requests failed with "connection reset by peer." The root cause: the new cluster had a Network Policy that allowed traffic on port 443 but the backend pods were listening on port 8443. The gateway's health checks passed (they checked a different port), but actual traffic hit port 8443, got RST from the kernel (no listener), and the gateway reported it as "connection reset." The fix was a two-line change to the Network Policy. Always check that your allowed ports match your actual listening ports.

Part 6: Window Size and Flow Control

TCP flow control prevents a fast sender from overwhelming a slow receiver. Each side advertises a receive window, the amount of data it can buffer.

How It Works

The receiver advertises its window size in every ACK: "I can accept 65,535 more bytes."
The sender never sends more than the advertised window without getting an ACK.
If the receiver is slow (maybe its application is busy), it advertises a smaller window.
If the window hits zero, the sender pauses. This is called zero-window, a sign that the receiver is overwhelmed.

# Detect zero-window situations
ss -ti dst 10.0.2.10 | grep -i "rcv_space\|snd_wnd"
# snd_wnd:0 indicates the remote side has a zero window

# In tcpdump
sudo tcpdump -i eth0 -nn port 8080 | grep "win 0"
# 10.0.2.10.8080 > 10.0.1.5.52431: Flags [.], ack 100001, win 0, length 0

KEY CONCEPT

If you see zero-window conditions in your connections, the bottleneck is the receiver, not the network. The receiver's application is not reading data from the socket fast enough. Common causes in K8s: CPU throttling on the receiving pod (increase CPU limits), garbage collection pauses in Java applications, or a single-threaded event loop that is blocked.

Part 7: Inspecting Connections with ss and netstat

The ss command (socket statistics) is the modern replacement for netstat. It is faster and provides more detail.

# Show all TCP connections with state
ss -tan
# State       Recv-Q Send-Q  Local Address:Port   Peer Address:Port
# LISTEN      0      128     0.0.0.0:8080          0.0.0.0:*
# ESTAB       0      0       10.0.1.5:52431        10.0.2.10:5432
# TIME-WAIT   0      0       10.0.1.5:52432        10.0.2.10:5432

# Count connections by state
ss -tan | tail -n +2 | awk '{print $1}' | sort | uniq -c | sort -rn
#  340 ESTAB
#  982 TIME-WAIT
#   12 CLOSE-WAIT
#    3 FIN-WAIT-2
#    1 LISTEN

# Show connections to a specific destination
ss -tan dst 10.0.2.10

# Show listening sockets with the process name
ss -tlnp
# LISTEN  0  128  0.0.0.0:8080  0.0.0.0:*  users:(("nginx",pid=1234,fd=6))

# Show detailed TCP info (RTT, retransmissions, window)
ss -ti dst 10.0.2.10
# ESTAB 0 0 10.0.1.5:52431 10.0.2.10:5432
#   cubic rto:204 rtt:1.5/0.5 mss:1448 rcv_space:14480 send 77.2Mbps

PRO TIP

In a Kubernetes debugging session, run ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn inside the pod or on the node. If you see hundreds of TIME_WAIT connections to the same destination, your application is creating new TCP connections for every request instead of using connection pooling. If you see growing CLOSE_WAIT, your application is leaking connections. This single command tells you more about connection health than most dashboards.

Key Concepts Summary

The three-way handshake (SYN, SYN-ACK, ACK) is required for every TCP connection and costs one full round-trip time of latency
Sequence numbers enable reliable, ordered delivery, every byte is tracked and acknowledged
Retransmission handles packet loss with exponential backoff, default 15 retries over ~15 minutes on Linux
TCP connection states tell you exactly what is happening: ESTABLISHED is healthy, TIME_WAIT is normal cleanup, CLOSE_WAIT is a leak
TIME_WAIT exists to prevent packet confusion between old and new connections on the same port tuple, do not disable it, use connection pooling instead
CLOSE_WAIT means your application is not closing sockets, this is always an application bug
TCP keepalive defaults (2 hours) are far too slow for Kubernetes, configure 30-60 second detection
RST (reset) is an immediate teardown, "Connection refused" means RST in response to SYN
Zero-window means the receiver is overwhelmed, the bottleneck is the application, not the network
ss is your primary tool for inspecting connection states on Linux

Common Mistakes

Trying to eliminate TIME_WAIT by setting net.ipv4.tcp_tw_reuse without understanding the consequences, this can cause data corruption on connections to the same destination
Ignoring CLOSE_WAIT connections: they indicate application-level connection leaks that will eventually exhaust file descriptors
Relying on TCP keepalive defaults (2 hours) in Kubernetes, pods restart in seconds, but dead connections linger for hours
Blaming the network when telnet hangs, a hanging telnet means SYN packets are being silently dropped, which is a firewall or security group issue, not a network outage
Not using connection pooling: creating a new TCP connection for every request wastes time on handshakes and accumulates TIME_WAIT
Confusing "Connection refused" (RST: the port has no listener) with "Connection timed out" (no response at all, likely a firewall dropping packets)
Setting tcp_retries2 too low without understanding that it affects all TCP connections on the system, not just the problematic ones

KNOWLEDGE CHECK

A pod has 200 connections in CLOSE_WAIT state to a database. What does this indicate?

DNS Debugging with dig, nslookup & tcpdump

Continue

TCP vs UDP, When to Use Which

←→ navigateM toggle sidebar