Networking Fundamentals for DevOps Engineers

The Systematic Troubleshooting Approach

"Pod A cannot reach Pod B." That is the Slack message from the developer. No other context. No error message. No logs. Just "cannot reach."

You could spend the next two hours guessing. Is it DNS? A NetworkPolicy? A firewall rule? A bad deployment? A CNI issue? You open six terminal tabs and start running random commands.

Or you can follow a systematic, layer-by-layer approach and find the root cause in five minutes. The difference between a junior and senior engineer is not knowledge of more tools — it is having a repeatable methodology that works every time, regardless of the problem.


Part 1: Why Guessing Fails

Every experienced engineer has a story of wasting hours on a network issue that turned out to be simple. The reason is always the same: they skipped the basics and jumped to the complex.

# The guessing approach (what we all do when panicked):
# 1. "Is it DNS?" → dig looks fine
# 2. "Maybe NetworkPolicy?" → spend 20 min reviewing policies
# 3. "Is the pod even running?" → kubectl get pods, yes it is Running
# 4. "Firewall?" → check security groups for 30 minutes
# 5. "CNI issue?" → restart Cilium agent on the node
# 6. "Wait... what port is the app actually listening on?"
# 7. kubectl exec into pod → ss -tlnp → app is listening on 9090, not 8080
# 8. Fix the Service targetPort. Done. 2 hours wasted.
WAR STORY

A team escalated a P1 incident: their payment service could not reach the fraud detection service. They had already spent 90 minutes checking NetworkPolicies, restarting the CNI, and even replacing the node. The actual cause? The fraud detection service had been redeployed with a new container image that changed the listening port from 8080 to 8443 (HTTPS). Nobody updated the Kubernetes Service targetPort. A single ss -tlnp inside the pod would have revealed this in 10 seconds.

The systematic approach works because network issues always exist at a specific layer. By checking layers in order — from the bottom up — you find the problem at the layer where it actually is, without wasting time on layers that are working fine.


Part 2: The 5-Step Network Debug Methodology

Every network issue falls into one of five categories. Check them in order. Stop at the first failure.

The 5-Step Network Debug Methodology

Click each step to explore

Step 1: Verify the Symptom

This is the step everyone skips, and it is the most important. "Cannot reach" could mean five different things, and each one has a different root cause.

# Ask the developer (or figure out yourself):
# What EXACTLY happens when you try to connect?

# Symptom: Connection timeout (no response)
curl -v --connect-timeout 5 http://pod-b:8080
# * Trying 10.244.2.8:8080...
# * Connection timed out after 5001 milliseconds
# Cause: Firewall/NetworkPolicy dropping packets, wrong IP, routing issue

# Symptom: Connection refused (immediate RST)
curl -v http://pod-b:8080
# * Trying 10.244.2.8:8080...
# * connect to 10.244.2.8 port 8080 failed: Connection refused
# Cause: Nothing listening on that port (app crashed, wrong port)

# Symptom: DNS resolution failure
curl -v http://pod-b-service:8080
# * Could not resolve host: pod-b-service
# Cause: Service does not exist, wrong namespace, CoreDNS issue

# Symptom: HTTP error
curl -v http://pod-b:8080/api/health
# < HTTP/1.1 503 Service Unavailable
# Cause: App is running but unhealthy (database down, config error)

# Symptom: TLS error
curl -v https://pod-b:8443
# * SSL: certificate subject name does not match target host name
# Cause: Certificate mismatch, expired cert, wrong TLS config
KEY CONCEPT

The symptom tells you which layer to investigate. Timeout = L3/L4 (network/transport). Connection refused = L4 (transport — port not open). DNS failure = DNS. HTTP error = L7 (application). TLS error = L6 (presentation/TLS). If you correctly identify the symptom, you skip directly to the right layer and save massive amounts of time.

Step 2: Check L3 — Network Layer

Can packets reach the destination at all?

# Ping the destination IP (not hostname — isolate DNS)
ping -c 3 10.244.2.8
# PING 10.244.2.8 (10.244.2.8): 56 data bytes
# 64 bytes from 10.244.2.8: icmp_seq=0 ttl=62 time=0.543 ms
# → L3 is working. Move to L4.

# If ping fails:
ping -c 3 10.244.2.8
# PING 10.244.2.8 (10.244.2.8): 56 data bytes
# Request timeout for icmp_seq 0
# → Possible causes:
#   1. Destination IP does not exist (pod was deleted)
#   2. Routing issue (no route to 10.244.x.x subnet)
#   3. Firewall/NetworkPolicy blocking ICMP
#   4. CNI issue (pod network not functioning)

# Check routing
ip route get 10.244.2.8
# 10.244.2.8 via 10.0.1.1 dev eth0 src 10.244.1.5
# → Route exists. If "unreachable" → no route to destination.

# Traceroute to see where packets stop
traceroute -n 10.244.2.8
# 1  10.244.1.1   0.5ms   # Node gateway
# 2  10.0.1.1     1.2ms   # Cluster network
# 3  * * *                 # Packets dropped here
# → The problem is between hop 2 and the destination
WARNING

Ping uses ICMP, and many environments block ICMP. A failed ping does NOT always mean L3 is broken — it might just mean ICMP is filtered. If ping fails, try a TCP-based check at L4 before concluding the network is down. In Kubernetes, most CNIs allow ICMP between pods by default, but NetworkPolicies can block it.

Step 3: Check L4 — Transport Layer

Can you establish a TCP connection to the specific port?

# Test TCP connectivity with netcat
nc -zv 10.244.2.8 8080
# Connection to 10.244.2.8 8080 port [tcp/*] succeeded!
# → L4 is working. Port is open and accepting connections. Move to L7.

# If connection refused:
nc -zv 10.244.2.8 8080
# nc: connect to 10.244.2.8 port 8080 (tcp) failed: Connection refused
# → Nothing is listening on port 8080. Check:
#   1. Is the app actually running? (kubectl exec pod -- ps aux)
#   2. What port is the app listening on? (kubectl exec pod -- ss -tlnp)
#   3. Is the app binding to 0.0.0.0 or 127.0.0.1?

# If timeout:
nc -zv -w 5 10.244.2.8 8080
# nc: connect to 10.244.2.8 port 8080 (tcp) failed: Operation timed out
# → Port might be firewalled. Check:
#   1. NetworkPolicy blocking the traffic
#   2. Cloud security group rules
#   3. Host-level iptables rules
PRO TIP

A common gotcha: the application binds to 127.0.0.1 (localhost) instead of 0.0.0.0 (all interfaces). When bound to localhost, the app only accepts connections from within the same network namespace. Connections from outside the pod are refused. Check with ss -tlnp and look at the Local Address column. If it shows 127.0.0.1:8080 instead of *:8080 or 0.0.0.0:8080, that is the problem.

Step 4: Check L7 — Application Layer

The TCP connection succeeds. Now, does the application respond correctly?

# Test HTTP response
curl -v http://10.244.2.8:8080/health
# * Connected to 10.244.2.8 (10.244.2.8) port 8080
# > GET /health HTTP/1.1
# > Host: 10.244.2.8:8080
# >
# < HTTP/1.1 200 OK
# < Content-Type: application/json
# {"status":"healthy"}
# → L7 is working. The issue is elsewhere.

# If the app returns an error:
curl -v http://10.244.2.8:8080/health
# < HTTP/1.1 503 Service Unavailable
# {"error":"database connection failed"}
# → The network is fine. The app has an internal error.
#   This is NOT a networking problem. Escalate to the app team.

Step 5: Check Side Effects

If basic connectivity tests pass but real traffic fails, check these common side effects:

# DNS: Does the hostname resolve correctly?
dig api-service.production.svc.cluster.local +short
# 10.96.45.123
# If empty: Service does not exist, or CoreDNS is not resolving

# TLS: Is the certificate valid?
openssl s_client -connect 10.244.2.8:8443 -servername api.example.com </dev/null 2>&1 | head -20
# Verify return code: 0 (ok)
# If non-zero: cert expired, wrong hostname, untrusted CA

# NetworkPolicy: Is traffic explicitly denied?
kubectl get networkpolicy -n production
kubectl describe networkpolicy -n production

# MTU: Are large packets being dropped?
ping -c 3 -s 1472 -M do 10.244.2.8
# If "Frag needed and DF set" → MTU mismatch between source and destination

Part 3: The Kubernetes-Specific Debug Flow

In Kubernetes, the generic 5-step methodology gets a K8s-specific overlay. Before checking L3/L4/L7, verify the Kubernetes objects are correct.

# K8s Debug Flow:
# 1. Is the pod running?
kubectl get pod -n production -l app=api
# STATUS: Running? CrashLoopBackOff? Pending? ImagePullBackOff?

# 2. Is the pod ready?
kubectl get pod -n production -l app=api
# READY: 1/1? 0/1 means readiness probe is failing

# 3. Do Endpoints exist?
kubectl get endpoints api-service -n production
# If empty: selector mismatch or no Ready pods

# 4. Does DNS resolve?
kubectl exec debug-pod -- dig api-service.production.svc.cluster.local +short
# Should return the ClusterIP

# 5. Is the port reachable via the Service?
kubectl exec debug-pod -- curl -v http://api-service.production:80

# 6. Is the port reachable directly on the pod?
kubectl exec debug-pod -- curl -v http://10.244.2.8:8080
# If Step 6 works but Step 5 fails: issue is in kube-proxy/iptables

# 7. Does the app respond correctly?
kubectl exec debug-pod -- curl -v http://10.244.2.8:8080/health

Kubernetes Service Connectivity Debug Flow

Click each step to explore


Part 4: The Debug Pod — Your Portable Toolbox

Most application containers are minimal — they have no networking tools. You cannot run curl, dig, ping, or traceroute from inside them. The solution is a debug pod with all networking tools pre-installed.

# The standard debug pod (nicolaka/netshoot)
kubectl run debug --image=nicolaka/netshoot -it --rm -- bash

# What netshoot includes:
# curl, wget, ping, traceroute, mtr, dig, nslookup, host
# nc (netcat), nmap, ss, ip, iptables, tcpdump
# openssl, iperf3, ethtool, strace

# Run it in a specific namespace to test Service DNS:
kubectl run debug -n production --image=nicolaka/netshoot -it --rm -- bash

# Run it on a specific node (to test node-level networking):
kubectl run debug --image=nicolaka/netshoot -it --rm \
  --overrides='{"spec":{"nodeSelector":{"kubernetes.io/hostname":"node-1"}}}' \
  -- bash
KEY CONCEPT

Always run the debug pod in the SAME NAMESPACE as the Service you are debugging. Kubernetes DNS resolves short names (api-service) only within the same namespace. From a different namespace, you must use the full name: api-service.production.svc.cluster.local. Running the debug pod in the wrong namespace is a common source of "DNS does not resolve" false alarms.

Debugging Without a Debug Pod

Sometimes you cannot create new pods (restrictive RBAC, pod security policies). Alternative approaches:

# Use kubectl exec into an existing pod (if it has a shell)
kubectl exec -it existing-pod -n production -- /bin/sh

# Use ephemeral containers (Kubernetes 1.25+)
kubectl debug -it existing-pod -n production \
  --image=nicolaka/netshoot --target=app-container

# Use nsenter from the node (requires SSH to node)
# Find the pod PID:
crictl ps | grep my-pod
crictl inspect <container-id> | grep pid
# Enter the pod network namespace:
nsenter -t <pid> -n -- curl -v http://10.244.2.8:8080
PRO TIP

Ephemeral containers (kubectl debug) are the cleanest way to debug without modifying the pod spec. They share the network namespace of the target pod, so you can test connectivity from exactly the same network context. This is especially useful when NetworkPolicies are involved — the debug pod needs to be in the same network context as the pod experiencing the issue.


Part 5: When to Escalate

Some network issues are beyond application-level debugging. Know when to escalate:

Escalate to the CNI/networking team when:

  • Multiple pods across different nodes cannot communicate (cluster-wide)
  • ip route shows missing or incorrect routes for pod CIDRs
  • CNI agent pods (Cilium, Calico, Flannel) are crashing or not running
  • Node-to-node connectivity is broken (ping between node IPs fails)

Escalate to the cloud/infrastructure team when:

  • Security group rules are blocking expected traffic
  • VPC peering or Transit Gateway routes are missing
  • Load balancer health checks are failing at the cloud level
  • DNS resolution fails for external domains (not just cluster DNS)

Escalate to the kernel/OS team when:

  • dmesg shows networking errors (nf_conntrack table full, etc.)
  • conntrack -S shows dropped connections
  • MTU issues persist despite correct configuration
  • iptables rules are not being applied correctly
# Quick checks before escalating:

# Is conntrack table full? (causes random connection drops)
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# If count is close to max: increase max or investigate connection leaks

# Are there dropped packets at the kernel level?
cat /proc/net/dev
# Look at drop column for the pod interface

# Is the CNI agent running?
kubectl get pods -n kube-system -l k8s-app=cilium  # or calico-node, etc.
WAR STORY

We had intermittent connection drops affecting 0.1% of requests across the entire cluster. Random pods, random services, no pattern. After two days of debugging at L4/L7, we checked the conntrack table: it was at 95% capacity. The default nf_conntrack_max was 131072. With our pod count and connection rate, we needed 524288. Doubling the conntrack max immediately fixed the issue. Always check system-level limits when the problem is intermittent and cluster-wide.


Key Concepts Summary

  • The 5-step methodology: Verify symptom, Check L3 (ping/traceroute), Check L4 (nc/telnet), Check L7 (curl), Check side effects (DNS/TLS/NetworkPolicy)
  • The symptom tells you the layer: timeout = L3/L4, connection refused = L4, DNS failure = DNS, HTTP error = L7, TLS error = TLS
  • In Kubernetes, check objects first: Pod Running? Pod Ready? Endpoints exist? DNS resolves? Port reachable via Service?
  • Empty Endpoints is the #1 cause of Service connectivity failures — always check selector labels
  • Use nicolaka/netshoot as your debug pod — it has every networking tool pre-installed
  • Run debug pods in the same namespace as the Service you are debugging — DNS short names only resolve within the same namespace
  • Ephemeral containers (kubectl debug) let you debug without creating new pods
  • App binding to 127.0.0.1 instead of 0.0.0.0 is a silent killer — check with ss -tlnp
  • Know when to escalate: cluster-wide issues, CNI agent crashes, conntrack table full, node-to-node failures

Common Mistakes

  • Jumping to complex explanations (CNI bug, kernel issue) before checking the basics (is the pod running? is the port correct?)
  • Running debug pods in the wrong namespace and concluding DNS is broken
  • Assuming a failed ping means the network is down — ICMP may be blocked while TCP works fine
  • Not distinguishing between timeout (packets dropped) and connection refused (port not open) — they have completely different root causes
  • Forgetting to check if the application binds to 0.0.0.0 vs 127.0.0.1 — localhost binding silently rejects all external connections
  • Ignoring the Endpoints object — spending hours debugging the network when the Service selector simply does not match the pod labels
  • Not checking conntrack table limits during intermittent cluster-wide connection drops

KNOWLEDGE CHECK

You run nc -zv 10.244.2.8 8080 from a debug pod and get 'Connection refused'. What does this tell you?