Networking Fundamentals for DevOps Engineers

TCP vs UDP — When to Use Which

Your gRPC service works perfectly in local development. Requests are fast, streaming works, everything is smooth. Then you deploy to Kubernetes and route traffic through a ClusterIP Service. Suddenly, connections drop, latency spikes, and a well-meaning teammate suggests: "gRPC is slow over TCP in K8s. Let's try UDP."

That suggestion is fundamentally wrong, and understanding why it is wrong requires knowing what TCP and UDP actually are, what guarantees they provide, and which protocols depend on those guarantees.

This lesson gives you the mental model to immediately know whether a protocol runs on TCP or UDP — and why.


Part 1: TCP — Reliable, Ordered, Connection-Oriented

TCP (Transmission Control Protocol) is defined in RFC 793. It provides three guarantees that UDP does not:

  1. Reliability — every byte sent will arrive at the destination, or the sender will be notified of failure
  2. Ordering — bytes arrive in the same order they were sent
  3. Flow control — the sender will not overwhelm the receiver

These guarantees come at a cost:

  • Connection setup — the three-way handshake adds one RTT of latency before any data flows
  • Head-of-line blocking — if one packet is lost, all subsequent packets must wait for the retransmission
  • Overhead — TCP headers are 20-60 bytes (vs 8 bytes for UDP), plus ACK traffic in both directions
  • State — both sides must maintain connection state (sequence numbers, windows, timers)
# TCP header (20 bytes minimum)
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |          Source Port          |       Destination Port        |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |                        Sequence Number                       |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |                    Acknowledgment Number                     |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# | Offset|  Res  |C|E|U|A|P|R|S|F|          Window Size         |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |           Checksum            |         Urgent Pointer        |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |                    Options (if any)                          |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
KEY CONCEPT

TCP's reliability guarantee means the protocol handles retransmission, ordering, and error detection for you. The application never has to worry about missing or out-of-order data. This is why virtually every request/response protocol (HTTP, gRPC, database protocols, SSH) is built on TCP — because the application developers do not want to implement their own reliability layer.

What TCP Gives You (and What It Costs)

FeatureBenefitCost
Three-way handshakeBoth sides confirmed reachable1 RTT before data flows
Sequence numbersData arrives in order8 bytes of header, state tracking
AcknowledgmentsSender knows what arrivedReturn traffic, latency
RetransmissionLost packets recoveredDelay on loss (head-of-line blocking)
Flow controlReceiver not overwhelmedMay throttle sender
Congestion controlNetwork not overwhelmedSlow start, reduced throughput initially

Part 2: UDP — Unreliable, Unordered, Connectionless

UDP (User Datagram Protocol) is defined in RFC 768. It provides almost nothing:

  • No connection setup — just send packets
  • No reliability — if a packet is lost, it is gone
  • No ordering — packets may arrive in any order
  • No flow control — the sender can blast as fast as it wants
  • No congestion control — UDP will happily saturate the network

The entire UDP header is 8 bytes:

# UDP header (8 bytes — that's it)
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |          Source Port          |       Destination Port        |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |            Length             |           Checksum            |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

TCP vs UDP — Protocol Comparison

TCP

Reliable, ordered, connection-oriented

Header size20-60 bytes
Connection setupThree-way handshake (1 RTT)
ReliabilityGuaranteed delivery with retransmission
OrderingBytes arrive in order sent
Flow controlReceiver window prevents overload
Congestion controlSlow start, AIMD, cubic
StateBoth sides track connection
Use whenCorrectness matters more than speed
UDP

Unreliable, unordered, connectionless

Header size8 bytes
Connection setupNone — just send
ReliabilityNone — packets may be lost
OrderingNone — packets may arrive out of order
Flow controlNone — sender blasts freely
Congestion controlNone — application must self-regulate
StateStateless — no connection to track
Use whenSpeed matters more than perfection

Why UDP Exists

If UDP provides no guarantees, why use it? Because sometimes the guarantees are worse than the problem:

1. The data is time-sensitive. In a video call, a packet that arrives 500ms late (after TCP retransmission) is useless — the moment has passed. It is better to skip it and show the next frame. UDP lets the application decide what to do with missing data instead of forcing a wait.

2. The interaction is a single request/response. DNS queries are typically one packet out, one packet back. A three-way handshake to set up a connection would triple the latency for a 50-byte question. UDP skips the handshake entirely.

3. The application implements its own reliability. QUIC (the protocol under HTTP/3) runs on UDP but implements its own reliability, ordering, and congestion control in userspace. This lets it fix TCP's head-of-line blocking problem while still being reliable.

PRO TIP

A useful mental model: TCP says "the network will handle reliability for you." UDP says "you handle reliability yourself (or don't)." Modern protocols like QUIC choose UDP not because they want unreliable delivery, but because they want to implement reliability differently than TCP does. Raw UDP with no application-level reliability is only appropriate for truly fire-and-forget use cases like metrics collection or logging.


Part 3: Where Each Protocol Is Used

Protocols That Use TCP

ProtocolPortWhy TCP
HTTP/1.1, HTTP/280, 443Every byte of a web page must arrive in order
gRPC443 (over HTTP/2)Streaming RPCs need reliable ordered delivery
SSH22You cannot afford to lose keystrokes
PostgreSQL5432Database queries and results must be complete and ordered
MySQL3306Same as PostgreSQL
Redis6379Commands and responses must arrive reliably
SMTP25, 587Email must not lose content
FTP21File transfers need completeness

Protocols That Use UDP

ProtocolPortWhy UDP
DNS (queries)53Single packet request/response, speed matters
DHCP67, 68Broadcast-based, no connection possible
NTP123Time sync packets — late data is useless
SNMP161, 162Monitoring polls — missing one is fine
syslog514Log shipping — some loss acceptable
QUIC/HTTP/3443Implements own reliability on top of UDP
Video streaming (RTP)VariousLate frames are useless, skip and move on
Gaming protocolsVariousStale position updates are worse than no update
KEY CONCEPT

DNS is the most important UDP protocol you will encounter in Kubernetes. Every service discovery lookup, every external API call, every domain resolution starts with a UDP packet to port 53. When DNS is slow or broken in your cluster, everything is slow or broken. CoreDNS performance directly impacts every microservice.

DNS: The Protocol That Uses Both

DNS primarily uses UDP, but switches to TCP in specific cases:

  1. Response exceeds 512 bytes (or 4096 with EDNS0) — the server responds with the TC (truncated) flag, and the client retries over TCP
  2. Zone transfers (AXFR/IXFR) — when a secondary DNS server copies the full zone from a primary, it always uses TCP because the data is large
  3. DNS over TLS (DoT) — port 853, always TCP
  4. DNS over HTTPS (DoH) — port 443, always TCP (it is HTTP)
# DNS over UDP (default)
dig google.com
# ;; MSG SIZE  rcvd: 55    (fits in one UDP packet)

# Force DNS over TCP
dig +tcp google.com
# ;; MSG SIZE  rcvd: 55    (same answer, but used TCP — slower)

# See if responses are being truncated
dig large-record.example.com | grep flags
# ;; flags: qr rd ra tc;    (tc = truncated, will retry over TCP)
WARNING

If your Kubernetes NetworkPolicies allow DNS on UDP port 53 but block TCP port 53, you will have intermittent DNS failures. Any DNS response that exceeds the UDP size limit will fail because the TCP fallback is blocked. Always allow both UDP and TCP on port 53 in your DNS policies. This is one of the most common networking misconfigurations in K8s.


Part 4: QUIC and HTTP/3 — Reliable Protocol on UDP

QUIC deserves special attention because it confuses people. "HTTP/3 uses UDP" sounds like it is unreliable. It is not.

QUIC is a reliable, multiplexed transport protocol built on top of UDP datagrams. It reimplements everything TCP provides — reliability, ordering, congestion control — but does it in userspace rather than in the kernel.

Why Not Just Use TCP?

TCP has a fundamental problem: head-of-line blocking at the transport layer. If you multiplex 10 HTTP/2 streams over a single TCP connection, and one packet from stream 3 is lost, TCP holds up ALL 10 streams while it retransmits that packet. The other 9 streams are fine — their data has arrived — but TCP does not know about streams. It only knows about one ordered byte stream.

QUIC solves this because it knows about streams. If a packet from stream 3 is lost, only stream 3 is blocked. The other 9 streams continue unaffected.

HTTP/2 over TCP vs HTTP/3 over QUIC

HTTP/2 over TCP

All streams share one TCP connection

TransportTCP (kernel)
StreamsMultiplexed, but TCP sees one byte stream
Packet lossOne lost packet blocks ALL streams
Connection setupTCP handshake + TLS handshake = 2-3 RTT
Connection migrationNot possible — tied to IP:port tuple
Head-of-line blockingAt TCP level — all streams blocked
HTTP/3 over QUIC

Each stream is independent

TransportQUIC over UDP (userspace)
StreamsMultiplexed with independent loss recovery
Packet lossOnly the affected stream is blocked
Connection setupCombined transport + TLS = 1 RTT (0-RTT on resumption)
Connection migrationSurvives IP changes (uses connection ID)
Head-of-line blockingEliminated — streams are independent
PRO TIP

In Kubernetes, QUIC/HTTP/3 support depends on your ingress controller. NGINX Ingress has experimental HTTP/3 support. Envoy (used by Istio and Gateway API) has production QUIC support. If your clients are browsers, enabling HTTP/3 on your ingress can noticeably improve latency for users on lossy networks (mobile, WiFi). For internal service-to-service communication, HTTP/2 over TCP is still the norm.


Part 5: TCP and UDP in Kubernetes

Service Protocol Configuration

Kubernetes Services default to TCP. You can explicitly set the protocol:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: my-app
  ports:
    - name: http
      port: 80
      targetPort: 8080
      protocol: TCP      # default if omitted

    - name: dns
      port: 53
      targetPort: 5353
      protocol: UDP      # must be explicit for UDP
WARNING

You cannot mix TCP and UDP ports in a single LoadBalancer Service on most cloud providers. If you need both (like CoreDNS, which serves DNS on both TCP and UDP port 53), you need two separate Services or use a cloud-specific annotation. This limitation catches people when they try to expose game servers or media services that use both protocols.

How kube-proxy Handles TCP vs UDP

kube-proxy (or its IPVS/eBPF replacement) handles TCP and UDP differently:

TCP with iptables mode:

  • DNAT rule rewrites destination IP from ClusterIP to a pod IP
  • Connection tracking (conntrack) ensures all packets in the same TCP connection go to the same pod
  • If the backend pod dies, existing connections hang until TCP timeout

UDP with iptables mode:

  • Same DNAT rule, but no connection concept
  • conntrack tracks UDP "connections" based on the 5-tuple with a default timeout of 30 seconds
  • If a DNS response is slow, a second query might get load-balanced to a different pod — and the response from the first pod gets dropped by conntrack
# Check conntrack entries for TCP and UDP
conntrack -L -p tcp | head -5
conntrack -L -p udp | head -5

# Count conntrack entries by protocol
conntrack -C
# Total: 15432

# conntrack table full? Packets get dropped silently
dmesg | grep conntrack
# nf_conntrack: table full, dropping packet
WAR STORY

We had a production cluster where DNS resolution would randomly fail about 1% of the time. The issue was a known Linux kernel bug (now fixed) where UDP packets from pods using SNAT would occasionally get both the source port rewritten AND the conntrack entry confused due to a race condition. The symptom was that DNS responses were being delivered to the wrong pod. The fix was upgrading the kernel. The workaround was setting net.netfilter.nf_conntrack_udp_timeout and nf_conntrack_udp_timeout_stream to more aggressive values. Always check your kernel version if you see intermittent DNS failures in K8s.


Part 6: Why gRPC Needs TCP (Answering the Opening Scenario)

Back to the opening scenario: your gRPC service works locally but fails through a Kubernetes Service. The suggestion to "try UDP" is wrong for a fundamental reason.

gRPC is built on HTTP/2. HTTP/2 is built on TCP. gRPC requires:

  1. Reliable delivery — RPC request and response bytes must all arrive
  2. Ordering — protobuf messages must be reassembled in order
  3. Bidirectional streaming — both sides send data simultaneously over one connection
  4. Multiplexing — many RPCs share one TCP connection

UDP provides none of these. Using "raw UDP" for gRPC is like suggesting you drive your car without wheels.

The actual problem with gRPC in Kubernetes is usually load balancing. gRPC uses long-lived HTTP/2 connections. A Kubernetes ClusterIP Service load-balances at the connection level, not the request level. If a client opens one gRPC connection, all requests go to one pod — no load distribution.

# The fix: use a gRPC-aware load balancer
# Option 1: Client-side load balancing (resolve all pod IPs)
# Option 2: Envoy/Istio sidecar proxy (L7 load balancing)
# Option 3: Use a headless Service + gRPC client-side balancing

# Headless Service (no ClusterIP, returns all pod IPs)
apiVersion: v1
kind: Service
metadata:
  name: grpc-service
spec:
  clusterIP: None     # headless
  selector:
    app: my-grpc-app
  ports:
    - port: 50051
      protocol: TCP
KEY CONCEPT

The gRPC-in-Kubernetes problem is never about TCP vs UDP. It is about L4 vs L7 load balancing. TCP load balancers (kube-proxy, ClusterIP) distribute connections. gRPC multiplexes many requests over one connection. You need an L7 load balancer (Envoy, Linkerd, Istio) that understands HTTP/2 framing and distributes individual requests across backend pods.


Key Concepts Summary

  • TCP provides reliability, ordering, and flow control at the cost of connection setup latency and head-of-line blocking
  • UDP provides nothing — it is a thin wrapper around IP that adds port numbers and a checksum
  • Most protocols use TCP because application developers do not want to implement their own reliability
  • UDP is used when speed matters more than completeness — DNS queries, video streaming, gaming, time-sync
  • DNS uses both protocols — UDP for queries, TCP for large responses and zone transfers. Always allow both in firewall rules.
  • QUIC/HTTP/3 runs on UDP but IS reliable — it implements TCP-like guarantees in userspace to fix head-of-line blocking
  • gRPC requires TCP (via HTTP/2) — the K8s problem is L7 load balancing, not protocol choice
  • In Kubernetes, Services default to TCP — UDP must be explicitly specified and has different conntrack behavior

Common Mistakes

  • Suggesting UDP for protocols that require reliability (gRPC, HTTP, database connections) — this reveals a fundamental misunderstanding of the transport layer
  • Blocking TCP port 53 in NetworkPolicies while allowing UDP port 53 — large DNS responses will fail
  • Assuming QUIC/HTTP/3 is "unreliable" because it runs on UDP — QUIC implements its own reliability
  • Not specifying protocol: UDP in K8s Service manifests — everything defaults to TCP
  • Mixing TCP and UDP ports in a single LoadBalancer Service without checking cloud provider support
  • Ignoring conntrack table limits — when the table fills up, both TCP and UDP packets are silently dropped
  • Diagnosing gRPC performance issues as a TCP/UDP problem when it is actually an L4/L7 load balancing problem

KNOWLEDGE CHECK

Why does DNS primarily use UDP instead of TCP for standard queries?