Networking Fundamentals for Engineers

OSI Meets Kubernetes Networking

Pod A cannot reach Pod B. Your developer files a ticket: "Networking is broken." You SSH into the node and start debugging. Is it DNS? Is it the CNI plugin? Is it a NetworkPolicy? Is it the application returning errors?

Without a framework, you are guessing. With the OSI model mapped to Kubernetes, you know exactly which layer to check. CNI problems are L2/L3. Service routing issues are L4. Ingress problems are L7. Each layer has different tools, different symptoms, and different fixes.

The Three Networks in Kubernetes

Before we map OSI layers, you need to understand that Kubernetes has three separate networks operating simultaneously. This is the single most important concept for understanding K8s networking.

The Three Networks in Kubernetes

Node Network (e.g., 10.0.0.0/16)

Node 1: 10.0.0.10

Node 2: 10.0.0.11

Node 3: 10.0.0.12

Pod Network (e.g., 10.244.0.0/16)

Pod A: 10.244.0.5

Pod B: 10.244.1.8

Pod C: 10.244.2.3

Service Network (e.g., 10.96.0.0/12)

svc-frontend: 10.96.0.15

svc-backend: 10.96.0.42

Hover components for details

1. Node Network. The physical (or cloud VPC) network connecting your Kubernetes nodes. These are the IPs your cloud provider assigns. Nodes communicate with each other over this network. This is a real, routable network.

2. Pod Network. A virtual network where every pod gets its own unique IP address. This is managed by the CNI plugin (Calico, Cilium, Flannel). Pods on the same node communicate directly. Pods on different nodes communicate through overlay or routing.

3. Service Network. A completely virtual network that exists only in iptables/IPVS rules. ClusterIP addresses are not assigned to any interface. They are translated to pod IPs by kube-proxy.

# See the three networks in action:

# Node network
kubectl get nodes -o wide
# NAME     STATUS   INTERNAL-IP   EXTERNAL-IP
# node-1   Ready    10.0.0.10     203.0.113.10
# node-2   Ready    10.0.0.11     203.0.113.11

# Pod network
kubectl get pods -o wide
# NAME       READY   IP            NODE
# pod-a      1/1     10.244.0.5    node-1
# pod-b      1/1     10.244.1.8    node-2

# Service network
kubectl get svc
# NAME          TYPE        CLUSTER-IP    PORT(S)
# kubernetes    ClusterIP   10.96.0.1     443/TCP
# my-service    ClusterIP   10.96.0.42    8080/TCP

KEY CONCEPT

The Service network is the most confusing because it is not real. There is no network interface with IP 10.96.0.42. There is no routing table entry for 10.96.0.0/12. When a pod sends a packet to 10.96.0.42, kube-proxy intercepts it (via iptables or IPVS rules) and rewrites the destination to the actual pod IP. The ClusterIP is a load-balancing virtual IP, nothing more.

Mapping OSI Layers to Kubernetes

Now let us map each OSI layer to the Kubernetes component that operates there.

Layer 2/3: The CNI Plugin: Pod-to-Pod Connectivity

The CNI (Container Network Interface) plugin is responsible for:

Assigning IP addresses to pods (L3)
Creating virtual network interfaces (veth pairs) for pods (L2)
Enabling pod-to-pod communication, both on the same node and across nodes

Different CNI plugins work at different layers:

CNI Plugin	L2/L3 Approach	Overlay	eBPF	NetworkPolicy
Flannel	VXLAN overlay (L2 over L3)	Yes	No	No (needs Calico)
Calico	BGP routing (L3) or VXLAN	Optional	Yes (eBPF mode)	Yes
Cilium	eBPF (L3/L4)	Optional	Yes (native)	Yes (L3-L7)
AWS VPC CNI	Native VPC IPs (L3)	No	No	Partial
Weave	VXLAN overlay + mesh	Yes	No	Yes

# Check which CNI is running
ls /etc/cni/net.d/
# 10-calico.conflist

# Check pod CIDR allocation per node
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
# node-1    10.244.0.0/24
# node-2    10.244.1.0/24
# node-3    10.244.2.0/24

PRO TIP

When choosing a CNI, the decision usually comes down to: (1) Do you need NetworkPolicy enforcement? If yes, skip Flannel. (2) Do you need L7 policies or service mesh features? If yes, choose Cilium. (3) Are you on AWS and want native VPC integration? Use AWS VPC CNI. (4) Do you want the simplest option that works? Calico with VXLAN.

How Pod-to-Pod Communication Works: Same Node

When Pod A (10.244.0.5) sends a packet to Pod B (10.244.0.8) on the same node, the packet never leaves the node:

Pod A sends the packet through its eth0 interface (which is actually a veth pair, a virtual Ethernet cable connecting the pod network namespace to the host)
The packet arrives on the host side of the veth pair (e.g., cali1234abcd in Calico)
The host Linux kernel routes the packet to Pod B's veth pair based on the routing table
Pod B receives the packet on its eth0

# See the veth pairs on a node
ip link show | grep cali
# 5: cali1234abcd@if4: <BROADCAST,MULTICAST,UP,LOWER_UP>
# 7: cali5678efgh@if6: <BROADCAST,MULTICAST,UP,LOWER_UP>

# See the routes to pod IPs on this node
ip route | grep cali
# 10.244.0.5 dev cali1234abcd scope link
# 10.244.0.8 dev cali5678efgh scope link

How Pod-to-Pod Communication Works: Across Nodes

When Pod A (10.244.0.5 on node-1) sends a packet to Pod C (10.244.1.8 on node-2), the packet must cross the node network. How this works depends on the CNI:

VXLAN Overlay (Flannel, Calico in VXLAN mode): The pod packet is encapsulated inside a UDP packet between the nodes. This is L2-over-L3 tunneling, the entire Ethernet frame (with pod IPs) is wrapped inside a new IP packet with node IPs.

BGP Routing (Calico in BGP mode): Each node advertises its pod CIDR to other nodes via BGP. The underlying network learns that 10.244.0.0/24 is reachable via node-1 and 10.244.1.0/24 via node-2. No encapsulation needed.

KEY CONCEPT

VXLAN adds 50 bytes of overhead to every packet (outer Ethernet 14 + outer IP 20 + outer UDP 8 + VXLAN header 8). This reduces the effective MTU from 1500 to 1450. If your pods send 1500-byte packets, they will be fragmented or dropped. Most CNI plugins handle this automatically by setting the pod interface MTU to 1450, but misconfiguration here is a common source of mysterious connectivity failures.

Layer 4: kube-proxy: Service Load Balancing

kube-proxy operates at Layer 4. It watches the Kubernetes API for Services and Endpoints, then programs the node to redirect traffic destined for ClusterIPs to actual pod IPs.

kube-proxy has three modes:

Mode	How It Works	Performance	Scalability
iptables	NAT rules in iptables chains	Good for small clusters	Degrades with thousands of services
IPVS	Linux Virtual Server (kernel L4 LB)	Better than iptables	Handles thousands of services
eBPF (Cilium)	Replaces kube-proxy entirely	Best	Excellent

# See how kube-proxy translates ClusterIP to pod IPs (iptables mode)
iptables -t nat -L KUBE-SERVICES -n | grep my-service
# -A KUBE-SERVICES -d 10.96.0.42/32 -p tcp --dport 8080 -j KUBE-SVC-XXXX

# Follow the chain to see the actual pod endpoints
iptables -t nat -L KUBE-SVC-XXXX -n
# -A KUBE-SVC-XXXX -m statistic --mode random --probability 0.333 -j KUBE-SEP-AAA
# -A KUBE-SVC-XXXX -m statistic --mode random --probability 0.500 -j KUBE-SEP-BBB
# -A KUBE-SVC-XXXX -j KUBE-SEP-CCC

# Each KUBE-SEP chain DNATs to a pod IP
iptables -t nat -L KUBE-SEP-AAA -n
# -A KUBE-SEP-AAA -p tcp -j DNAT --to-destination 10.244.0.5:8080

WAR STORY

A team reported that their service was "randomly dropping requests." One in three requests failed. Investigation revealed one of three backend pods was in CrashLoopBackOff, but kube-proxy was still sending traffic to it because the Endpoints object had not been updated yet (readiness probe was misconfigured with too long an initial delay). The iptables rules were load-balancing to three pods, but one was dead. Fix: set proper readiness probes with short initial delays so unhealthy pods are removed from the Endpoints list quickly.

Layer 4: NetworkPolicy: Traffic Filtering

NetworkPolicy resources operate at Layer 3 and Layer 4. They allow or deny traffic based on:

Source/destination pod labels (L3: via pod IP)
Source/destination namespaces (L3: via pod IP)
Ports and protocols (L4: TCP/UDP port numbers)

# Allow only frontend pods to reach backend on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

WARNING

NetworkPolicies are additive for allow rules but default-deny requires an explicit policy. If you create a NetworkPolicy that selects a pod, all traffic not explicitly allowed is denied for that pod. But if no NetworkPolicy selects a pod, all traffic is allowed. This asymmetry confuses many engineers. To lock down a namespace, start with a default-deny policy and then add explicit allow rules.

Layer 7: Ingress Controller: HTTP Routing

The Ingress controller operates at Layer 7. It understands HTTP and can make routing decisions based on:

Hostname (Host header)
URL path (/api/v1 goes to one service, /web goes to another)
TLS termination (decrypts HTTPS and forwards HTTP to pods)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
    - hosts:
        - devopsbeast.com
      secretName: tls-secret
  rules:
    - host: devopsbeast.com
      http:
        paths:
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 8080
          - path: /
            pathType: Prefix
            backend:
              service:
                name: frontend-service
                port:
                  number: 3000

Common Ingress controllers: NGINX Ingress, Traefik, HAProxy, AWS ALB Ingress Controller, Istio Gateway.

PRO TIP

The newer Gateway API is replacing Ingress for advanced use cases. Gateway API provides more expressive routing (header-based routing, traffic splitting, request mirroring) and a cleaner separation between infrastructure (Gateway) and application (HTTPRoute) configuration. If you are starting a new cluster, evaluate Gateway API before defaulting to Ingress.

Layer 7: Service Mesh: mTLS, Retries, Observability

Service meshes like Istio and Linkerd inject sidecar proxies into every pod. These proxies operate at Layer 7 and provide:

mTLS: automatic encryption between all pods (zero-trust networking)
Retries and timeouts: configurable per-route retry policies
Traffic routing: canary deployments, A/B testing, blue-green deployments
Observability: request-level metrics, distributed tracing

# Check if Istio sidecar is injected
kubectl get pod my-pod -o jsonpath='{.spec.containers[*].name}'
# my-app istio-proxy
#        ^^^^^^^^^^^  The sidecar proxy — intercepts all traffic

# Check mTLS status between services
istioctl authn tls-check my-pod.production

Troubleshooting by Layer in Kubernetes

Here is the systematic approach when a pod cannot reach a service:

Kubernetes Network Troubleshooting by OSI Layer

Click each step to explore

# The complete K8s network debugging sequence:

# Step 1: L3 — Pod-to-Pod IP connectivity
kubectl exec pod-a -- ping -c 3 10.244.1.8
# If fails: CNI issue

# Step 2: L4 — Port connectivity
kubectl exec pod-a -- nc -zv 10.244.1.8 8080
# If fails: NetworkPolicy or service not listening

# Step 3: L4 — Service discovery
kubectl exec pod-a -- nc -zv my-service.default.svc.cluster.local 8080
# If fails but Step 2 works: DNS or kube-proxy issue

# Step 4: L7 — Application response
kubectl exec pod-a -- curl -s http://my-service.default:8080/health
# If fails but Step 3 works: application issue

# Step 5: Check DNS specifically
kubectl exec pod-a -- nslookup my-service.default.svc.cluster.local
# If NXDOMAIN: service does not exist or wrong namespace
# If SERVFAIL: CoreDNS is broken

# Step 6: Check endpoints
kubectl get endpoints my-service
# If empty: no pods match the service selector, or pods are not ready

KEY CONCEPT

The most common Kubernetes networking "issue" is not a networking issue at all. It is one of: (1) DNS typo in the service name, (2) pods not ready so endpoints are empty, (3) wrong port number in the service spec, (4) missing or wrong label selector on the service. Always check kubectl get endpoints before assuming the network is broken.

How ClusterIP Actually Works

Let us demystify ClusterIP by tracing what happens when Pod A sends a request to my-service (ClusterIP 10.96.0.42, port 8080):

Pod A's application resolves my-service.default.svc.cluster.local via CoreDNS, which returns 10.96.0.42
Pod A sends a TCP SYN to 10.96.0.42:8080
The packet enters the host network namespace via the pod veth pair
iptables/IPVS on the host matches the destination (10.96.0.42:8080) against the KUBE-SERVICES chain
The matching rule performs DNAT (Destination NAT): rewrites the destination IP from 10.96.0.42 to a pod IP (e.g., 10.244.1.8)
If the target pod is on another node, the packet is routed via the CNI (VXLAN or BGP)
The target pod receives a packet that appears to come from Pod A's IP (10.244.0.5) with destination 10.244.1.8:8080
The response goes back to Pod A, and conntrack ensures the reverse translation happens correctly

# Trace this with conntrack
conntrack -L -d 10.96.0.42
# tcp  6 117 TIME_WAIT src=10.244.0.5 dst=10.96.0.42 sport=43210 dport=8080
#   src=10.244.1.8 dst=10.244.0.5 sport=8080 dport=43210 [ASSURED]
#   ^^^^^^^^^^^^^
#   The actual destination after DNAT

WAR STORY

A team had intermittent connection resets to a ClusterIP service. The issue was conntrack table exhaustion. Their cluster was handling 50,000 connections per second, and the default nf_conntrack_max was 65536. When the conntrack table filled up, new connections were dropped silently. Fix: increase nf_conntrack_max via sysctl, or migrate to Cilium eBPF mode which does not use conntrack for service routing.

Pod-to-Pod Across Nodes: VXLAN vs BGP

When pods on different nodes need to communicate, the CNI plugin must get the packet from one node to another. The two main approaches are fundamentally different:

VXLAN Overlay vs BGP Routing

VXLAN Overlay

Encapsulation (L2 over L3)

How it worksPod packet wrapped in outer UDP/IP packet with node IPs

MTU impactReduces effective MTU by 50 bytes (1500 becomes 1450)

Network requirementsNodes need IP connectivity only (works everywhere)

PerformanceSlight overhead from encapsulation/decapsulation

DebuggingHarder: tcpdump shows encrypted VXLAN packets

Used byFlannel (default), Calico (VXLAN mode), Cilium (VXLAN mode)

BGP Routing

Native routing (L3)

How it worksNodes advertise pod CIDRs via BGP, network learns routes natively

MTU impactNo MTU reduction, packets are routed natively

Network requirementsNetwork must support BGP peering (not all cloud VPCs do)

PerformanceNative speed: no encapsulation overhead

DebuggingEasier: standard IP packets visible in tcpdump

Used byCalico (BGP mode), Cilium (native routing)

Key Concepts Summary

Kubernetes has three networks: node network (physical/VPC), pod network (CNI-managed), and service network (virtual, iptables/IPVS)
CNI plugins operate at L2/L3: they assign pod IPs, create veth pairs, and handle cross-node routing via VXLAN or BGP
kube-proxy operates at L4: it translates ClusterIPs to pod IPs using iptables, IPVS, or eBPF
NetworkPolicy operates at L3/L4: allows or denies traffic based on pod labels, namespaces, and ports
Ingress controllers operate at L7: HTTP routing based on hostname and URL path, plus TLS termination
Service meshes operate at L7: mTLS, retries, traffic routing, and observability via sidecar proxies
ClusterIP is a virtual IP that only exists in iptables rules, no interface holds this IP
VXLAN reduces MTU by 50 bytes: misconfigured MTU is a common source of cross-node pod connectivity issues
Most K8s networking issues are not network issues: they are DNS typos, empty endpoints, wrong ports, or missing labels

Common Mistakes

Assuming ClusterIP is a real IP on a real interface, it is a DNAT rule, nothing more
Forgetting that NetworkPolicies are namespace-scoped and only enforced if the CNI supports them (Flannel does not)
Not checking kubectl get endpoints when a service is unreachable, empty endpoints means no pods match the selector
Ignoring MTU when using overlay networks. VXLAN reduces MTU by 50 bytes, causing fragmentation or dropped packets
Using ping to test service connectivity. ClusterIP does not respond to ICMP by default, use nc or curl instead
Confusing Pod DNS names with Service DNS names, pod-ip.namespace.pod.cluster.local exists but is rarely useful

KNOWLEDGE CHECK

A pod can reach another pod directly by IP (10.244.1.8:8080), but cannot reach it via the Service name (my-service:8080). What is the most likely cause?

How Packets Actually Flow

Continue

How DNS Resolution Works

←→ navigateM toggle sidebar