Networking Fundamentals for DevOps Engineers

Essential Network Commands

You SSH into a production Kubernetes node at 2 AM. An incident is in progress. The monitoring dashboard shows elevated 5xx errors. Developers say "the service is down." Your manager is in the incident channel asking for updates every 3 minutes.

You have no IDE, no fancy GUI tools, no time to read man pages. You need to diagnose the problem with command-line tools you have memorized. The seven commands in this lesson are the tools that senior engineers reach for instinctively. Learn them once, and you will use them for the rest of your career.


Command 1: curl — The Swiss Army Knife of HTTP

curl makes HTTP requests from the command line. It is the single most important debugging tool for any engineer working with web services.

The Essential Flags

# -v: Verbose output (shows headers, TLS handshake, timing)
# This is the flag you use 90% of the time
curl -v http://api-service:8080/health
# * Trying 10.96.45.123:8080...
# * Connected to api-service (10.96.45.123) port 8080
# > GET /health HTTP/1.1
# > Host: api-service:8080
# > User-Agent: curl/7.88.1
# > Accept: */*
# >
# < HTTP/1.1 200 OK
# < Content-Type: application/json
# < Content-Length: 22
# <
# {"status":"healthy"}

# -I: HEAD request (headers only, no body)
# Quick check: is the service responding? What status code?
curl -I http://api-service:8080/health
# HTTP/1.1 200 OK
# Content-Type: application/json
# Content-Length: 22

# -o /dev/null -s -w: Custom output format
# Extract specific information (status code, timing)
curl -o /dev/null -s -w "HTTP %{http_code} | Total: %{time_total}s | Connect: %{time_connect}s | TTFB: %{time_starttransfer}s\n" \
  http://api-service:8080/health
# HTTP 200 | Total: 0.045s | Connect: 0.002s | TTFB: 0.043s
KEY CONCEPT

The -w (write-out) flag is incredibly powerful for diagnosing latency issues. time_connect shows TCP handshake time (network latency). time_starttransfer shows Time To First Byte (TTFB — includes server processing time). If time_connect is high, the network is slow. If time_starttransfer is high but time_connect is low, the server is slow. This single command answers "is it the network or the app?"

Practical Scenarios

# Scenario 1: Test an endpoint with a specific Host header
# (Useful when DNS does not resolve but you know the IP)
curl -v -H "Host: api.example.com" http://10.96.45.123:80/health

# Scenario 2: Skip TLS verification (self-signed certs in dev/staging)
curl -vk https://api-service:8443/health

# Scenario 3: Set a connection timeout (do not hang forever)
curl -v --connect-timeout 5 --max-time 10 http://api-service:8080/health
# --connect-timeout: max seconds for TCP handshake
# --max-time: max seconds for the entire request

# Scenario 4: Force resolution to a specific IP
# (Test a specific pod without changing DNS)
curl -v --resolve api.example.com:443:10.244.1.5 https://api.example.com/health

# Scenario 5: Send a POST request with JSON body
curl -v -X POST -H "Content-Type: application/json" \
  -d '{"key": "value"}' \
  http://api-service:8080/api/data

# Scenario 6: Follow redirects
curl -vL http://api-service:8080/old-path
# -L follows 301/302 redirects automatically
PRO TIP

Use --resolve instead of editing /etc/hosts when you need to test a specific backend IP with a hostname. It works for a single curl request without affecting the rest of the system. This is perfect for testing whether a specific pod is healthy without going through the load balancer.


Command 2: dig — DNS Query Tool

dig (Domain Information Groper) queries DNS servers directly. It shows you exactly what DNS returns, including TTL, record type, and which server answered.

The Essential Patterns

# Basic query: what IP does this hostname resolve to?
dig api-service.production.svc.cluster.local
# ;; ANSWER SECTION:
# api-service.production.svc.cluster.local. 30 IN A 10.96.45.123

# +short: just the answer, no noise
dig api-service.production.svc.cluster.local +short
# 10.96.45.123

# Query a specific DNS server (CoreDNS in K8s)
dig @10.96.0.10 api-service.production.svc.cluster.local +short
# 10.96.45.123
# (@10.96.0.10 is typically the CoreDNS ClusterIP)

# Query specific record types
dig api.example.com A        # IPv4 address
dig api.example.com AAAA     # IPv6 address
dig api.example.com CNAME    # Canonical name (alias)
dig api.example.com MX       # Mail exchange
dig api.example.com TXT      # Text records (SPF, DKIM, etc.)
dig api.example.com NS       # Name servers
dig api.example.com SRV      # Service records (used by K8s headless Services)

DNS Debugging in Kubernetes

# Check if CoreDNS is resolving cluster-internal names
dig @10.96.0.10 kubernetes.default.svc.cluster.local +short
# 10.96.0.1  (the API server ClusterIP — always exists)

# Check a headless Service (returns pod IPs, not ClusterIP)
dig @10.96.0.10 database.production.svc.cluster.local +short
# 10.244.1.5
# 10.244.2.8
# 10.244.3.2

# Check SRV records for a headless Service (includes port)
dig @10.96.0.10 _tcp.database.production.svc.cluster.local SRV +short
# 0 33 5432 database-0.database.production.svc.cluster.local.
# 0 33 5432 database-1.database.production.svc.cluster.local.
# 0 33 5432 database-2.database.production.svc.cluster.local.

# Check external DNS resolution (goes through CoreDNS upstream)
dig @10.96.0.10 google.com +short
# 142.250.80.46
# If this fails: CoreDNS cannot reach upstream DNS servers
WARNING

When debugging DNS in Kubernetes, always use dig @10.96.0.10 (or whatever your CoreDNS ClusterIP is) to explicitly query the cluster DNS server. If you just run dig hostname, it uses /etc/resolv.conf, which inside a pod points to CoreDNS, but on a node might point to the cloud DNS (169.254.169.253 on AWS). This distinction matters when the problem is specific to cluster DNS.

The +trace Flag — Follow the Full Resolution Path

# Trace the full DNS resolution path from root servers down
dig +trace api.example.com
# .                       518400  IN  NS  a.root-servers.net.
# com.                    172800  IN  NS  a.gtld-servers.net.
# example.com.            172800  IN  NS  ns1.example.com.
# api.example.com.        300     IN  A   203.0.113.50
#
# This shows every step of DNS resolution:
# Root → .com TLD → example.com authoritative → final answer
# If resolution fails, +trace shows exactly WHERE it fails

DNS Resolution Path (dig +trace)

Click each step to explore


Command 3: nslookup — Simpler DNS Lookup

nslookup is a simpler alternative to dig. It does not show as much detail but is available on more systems (including many minimal container images).

# Basic lookup
nslookup api-service.production.svc.cluster.local
# Server:    10.96.0.10
# Address:   10.96.0.10#53
#
# Name:   api-service.production.svc.cluster.local
# Address: 10.96.45.123

# Query a specific DNS server
nslookup api-service.production.svc.cluster.local 10.96.0.10

# Reverse lookup (IP to hostname)
nslookup 10.96.45.123
# 123.45.96.10.in-addr.arpa  name = api-service.production.svc.cluster.local
PRO TIP

Use dig when you need detailed DNS debugging (TTL, authoritative server, trace). Use nslookup when you just need a quick "does this hostname resolve?" check and dig is not installed. In Kubernetes debug pods (netshoot), both are available — prefer dig for its richer output.


Command 4: nc (netcat) — TCP/UDP Connectivity Tester

Netcat is the simplest way to test whether you can establish a TCP connection to a specific host and port. It answers the question: "Is the port open?"

# Test TCP connectivity (-z: scan mode, -v: verbose)
nc -zv api-service 8080
# Connection to api-service 8080 port [tcp/*] succeeded!
# → Port is open, something is listening

nc -zv api-service 8080
# nc: connect to api-service port 8080 (tcp) failed: Connection refused
# → Port is NOT open (nothing listening, or wrong port)

nc -zv api-service 8080
# nc: connect to api-service port 8080 (tcp) failed: Operation timed out
# → Port is firewalled (packets dropped, no response)

# Test with a timeout (do not hang forever)
nc -zv -w 3 api-service 8080
# -w 3: timeout after 3 seconds

# Scan a range of ports
nc -zv api-service 8080-8090
# Connection to api-service 8080 port [tcp/*] succeeded!
# Connection to api-service 8081 port [tcp/*] succeeded!
# nc: connect to api-service port 8082 (tcp) failed: Connection refused
# ...

# Test UDP connectivity (less reliable — UDP has no handshake)
nc -zuv api-service 53
# Connection to api-service 53 port [udp/domain] succeeded!
KEY CONCEPT

The three results from nc -zv each mean something different. succeeded = port open, app is listening. Connection refused = host reachable but nothing on that port (RST packet received). Operation timed out = packets are being dropped (firewall, NetworkPolicy). Learn to distinguish these — they point to completely different root causes.

netcat as a Lightweight Server

# Start a listener on port 9090 (useful for testing connectivity)
nc -l -p 9090

# On another pod, test sending data:
echo "hello" | nc -w 1 debug-pod 9090
# If "hello" appears on the listener, end-to-end connectivity works

# This is useful for testing NetworkPolicies:
# 1. Start a listener in the destination pod
# 2. Send data from the source pod
# 3. If data arrives, the NetworkPolicy allows the traffic

Command 5: ss — Socket Statistics

ss shows network connections, listening ports, and socket statistics. It replaces the older netstat command and is faster on systems with many connections.

# Show all listening TCP ports with process names
ss -tlnp
# State    Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
# LISTEN   0       128     0.0.0.0:8080        0.0.0.0:*          users:(("java",pid=1,fd=5))
# LISTEN   0       128     127.0.0.1:9090      0.0.0.0:*          users:(("java",pid=1,fd=8))
#
# Flags: -t (TCP), -l (listening), -n (numeric), -p (process)

# CRITICAL INSIGHT from the output above:
# Port 8080 listens on 0.0.0.0 → accessible from ANY IP (correct)
# Port 9090 listens on 127.0.0.1 → accessible ONLY from localhost (problem if external access needed)
WARNING

When ss -tlnp shows a service listening on 127.0.0.1:PORT, it is only reachable from within the same pod/container. External connections (from other pods, from the Service) will get "Connection refused." This is the #1 silent killer in container networking. The fix is in the application config — change the bind address to 0.0.0.0.

Practical ss Patterns

# Show established connections to a specific port
ss -tnp state established '( dport = :5432 )'
# Shows all connections to PostgreSQL — useful for seeing connection count

# Count connections by state
ss -t state established | wc -l
# 245  ← 245 established TCP connections

ss -t state time-wait | wc -l
# 1203 ← 1203 connections in TIME_WAIT (high number = connection churn)

# Show all connections to a specific remote IP
ss -tnp dst 10.244.2.8
# Shows all connections from this host to a specific pod

# Show socket memory usage (debug buffer issues)
ss -tm
# Shows send/receive buffer sizes per connection

# Show summary statistics
ss -s
# Total: 534 (kernel 612)
# TCP:   402 (estab 245, closed 89, orphaned 2, synrecv 0, timewait 68/0)

When to Use Which Command

Quick Connectivity Checks

Is the target reachable?

HTTP endpoint working?curl -v http://host:port/path
DNS resolving?dig hostname +short
Port open?nc -zv host port
Listening ports?ss -tlnp
Network path?traceroute -n host
Deep Investigation

What exactly is happening?

Full HTTP exchange?curl -v (shows headers, TLS)
DNS resolution path?dig +trace hostname
Connection states?ss -tnp state established
Packet-level analysis?tcpdump -i any port 80
Continuous path monitoring?mtr host

Command 6: traceroute & mtr — Path Tracing

traceroute shows the network path (hops) between your machine and the destination. Each hop is a router that forwards your packet.

# Basic traceroute
traceroute -n 10.244.2.8
# traceroute to 10.244.2.8 (10.244.2.8), 30 hops max, 60 byte packets
#  1  10.244.1.1     0.5 ms   0.4 ms   0.5 ms    # Pod gateway
#  2  10.0.1.1       1.2 ms   1.1 ms   1.3 ms    # Node network
#  3  10.0.2.1       1.8 ms   1.7 ms   1.9 ms    # Destination node
#  4  10.244.2.8     2.1 ms   2.0 ms   2.2 ms    # Destination pod
#
# Flags: -n (numeric — do not resolve hostnames, much faster)

# TCP traceroute (bypasses ICMP firewalls)
traceroute -T -p 8080 10.244.2.8
# Uses TCP SYN instead of ICMP — more likely to succeed in
# environments that block ICMP but allow TCP

# When hops show * * *:
traceroute -n 10.244.2.8
#  1  10.244.1.1     0.5 ms   0.4 ms   0.5 ms
#  2  10.0.1.1       1.2 ms   1.1 ms   1.3 ms
#  3  * * *                                       # ← This hop drops/blocks
#  4  * * *                                       #    traceroute packets
# This does NOT necessarily mean the network is broken.
# Many routers drop ICMP/UDP traceroute packets by policy.
# Try -T (TCP) mode, or test with nc -zv instead.

mtr — Continuous traceroute

mtr combines traceroute and ping into a continuous monitoring tool. It sends packets repeatedly and shows real-time statistics per hop.

# Run mtr (interactive mode)
mtr -n 10.244.2.8
# HOST                   Loss%  Snt   Last  Avg   Best  Wrst  StDev
# 1. 10.244.1.1          0.0%   50    0.5   0.5   0.3   1.2   0.1
# 2. 10.0.1.1            0.0%   50    1.2   1.1   0.8   2.1   0.2
# 3. 10.0.2.1            2.0%   50    1.8   1.7   1.4   3.5   0.4   ← 2% loss
# 4. 10.244.2.8          2.0%   50    2.1   2.0   1.7   3.8   0.3

# Report mode (run 100 packets and print summary)
mtr -n --report -c 100 10.244.2.8

# The columns:
# Loss% = packet loss at this hop
# Snt   = packets sent
# Last  = last round-trip time
# Avg   = average round-trip time
# Best  = minimum round-trip time
# Wrst  = maximum round-trip time (worst)
# StDev = standard deviation (high = inconsistent)
PRO TIP

When reading mtr output, look for where packet loss STARTS. If hop 3 shows 2% loss and hop 4 also shows 2%, the problem is at hop 3 (the loss propagates downstream). If hop 3 shows 2% loss but hop 4 shows 0%, hop 3 is just rate-limiting ICMP responses — not actually dropping traffic. Only the FINAL hop loss percentage matters for real packet loss. Intermediate hops often deprioritize ICMP and show false packet loss.


Command 7: ip — Interface and Route Information

The ip command shows and manipulates network interfaces, routes, and ARP tables. It replaces ifconfig, route, and arp.

# Show all network interfaces and their IPs
ip addr show
# 1: lo: <LOOPBACK,UP,LOWER_UP>
#     inet 127.0.0.1/8 scope host lo
# 2: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP>
#     inet 10.244.1.5/32 scope global eth0
#          ^^^^^^^^^^^ This is the pod IP

# Show the routing table
ip route show
# default via 10.244.1.1 dev eth0
# 10.244.1.0/24 dev eth0 scope link
# 10.96.0.0/12 via 10.244.1.1 dev eth0
#
# This tells you:
# - Default gateway: 10.244.1.1 (traffic goes here if no specific route)
# - Local subnet: 10.244.1.0/24 (direct, no gateway)
# - Service CIDR: 10.96.0.0/12 (routed via gateway — because ClusterIPs
#   are virtual and handled by iptables on the node)

# Check route to a specific destination
ip route get 10.244.2.8
# 10.244.2.8 via 10.244.1.1 dev eth0 src 10.244.1.5
# → Traffic to 10.244.2.8 goes via gateway 10.244.1.1

# Show ARP/neighbor table (L2 adjacency)
ip neigh show
# 10.244.1.1 dev eth0 lladdr aa:bb:cc:dd:ee:ff REACHABLE
KEY CONCEPT

Inside a Kubernetes pod, ip route show reveals how the CNI configured networking. The default route goes to the node (via the pod gateway). The pod CIDR for the local node is a direct route. Service CIDRs are routed via the gateway because ClusterIPs are virtual — they are intercepted by iptables/IPVS on the node, not by actual routing. If ip route show inside a pod shows no default route, the CNI failed to set up networking for that pod.


Bonus Commands

These are not in the "top 7" but are worth knowing:

# wget: Download files (available in more minimal images than curl)
wget -qO- http://api-service:8080/health
# -q: quiet, -O-: output to stdout

# telnet: Simple TCP connectivity test (if nc is not available)
telnet api-service 8080
# Trying 10.96.45.123...
# Connected to api-service.

# arp: Show ARP table (older, use ip neigh instead)
arp -n

# ethtool: Show NIC details (on nodes, not in pods)
ethtool eth0
# Speed: 25000Mb/s
# Link detected: yes

# iperf3: Network bandwidth testing
# Server: iperf3 -s
# Client: iperf3 -c server-ip
# Shows throughput between two endpoints

Putting It All Together — A Real Debug Session

Here is a complete debug session using all seven commands:

# Incident: "Frontend cannot reach the API service"

# Step 1: What is the symptom?
curl -v --connect-timeout 5 http://api-service.production:80/health
# * Trying 10.96.45.123:80...
# * Connection timed out
# Symptom: TIMEOUT → L3/L4 issue (not L7)

# Step 2: Does DNS resolve?
dig api-service.production.svc.cluster.local +short
# 10.96.45.123  ← DNS is fine

# Step 3: Can we reach the ClusterIP?
nc -zv -w 3 10.96.45.123 80
# Timeout  ← Cannot reach ClusterIP on port 80

# Step 4: Can we reach a pod directly?
kubectl get endpoints api-service -n production
# 10.244.2.8:8080,10.244.3.2:8080  ← Endpoints exist

nc -zv -w 3 10.244.2.8 8080
# Connection succeeded!  ← Pod is reachable directly

# Conclusion: Pod reachable, ClusterIP not reachable
# Issue is in kube-proxy / iptables layer

# Step 5: Check kube-proxy
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# NAME                READY   STATUS             AGE
# kube-proxy-abc12    0/1     CrashLoopBackOff   5m   ← There it is!

# kube-proxy crashed → iptables rules not updated → ClusterIP not working
# Fix: investigate kube-proxy crash logs, restart
kubectl logs -n kube-system kube-proxy-abc12
WAR STORY

This exact scenario happened in production. kube-proxy crashed after a node kernel upgrade changed the iptables binary path. The existing iptables rules continued to work (they are in the kernel, not in the kube-proxy process), but new Services and Endpoint changes were not being programmed. Old Services worked fine, but any Service created or scaled after the crash was unreachable. It took 30 minutes to notice because "most things work" masked the partial failure.


Key Concepts Summary

  • curl -v is your primary HTTP debugging tool — use -w for timing analysis and --resolve to test specific backends
  • dig with +short for quick DNS lookups, +trace for resolution path debugging, @server for querying specific DNS servers
  • nc -zv answers "is this port open?" — distinguish between succeeded (open), refused (nothing listening), and timeout (firewalled)
  • ss -tlnp shows what is listening and on which address — 127.0.0.1 vs 0.0.0.0 is the most common misconfiguration
  • traceroute -T uses TCP instead of ICMP — more reliable in environments that block ICMP
  • mtr combines traceroute and ping for continuous monitoring — only the final hop loss matters, intermediate hops often show false loss
  • ip route show reveals how the CNI configured pod networking — no default route means CNI setup failed

Common Mistakes

  • Running dig hostname on a node and expecting Kubernetes DNS results — nodes use different resolvers than pods, use dig @10.96.0.10 hostname
  • Interpreting intermediate mtr hop loss as real packet loss — many routers deprioritize ICMP and show false loss
  • Using curl without -v — you miss headers, TLS details, and timing that are critical for diagnosis
  • Forgetting -n on traceroute — reverse DNS lookups add seconds of delay per hop, making the output painfully slow
  • Not checking ss -tlnp early in debugging — spending hours on network investigation when the app is listening on the wrong address or port
  • Using ping to test connectivity and concluding the network is broken when ICMP is simply blocked
  • Not setting timeouts on nc and curl — commands hang for minutes waiting for firewalled connections

KNOWLEDGE CHECK

You run ss -tlnp inside a pod and see the application listening on 127.0.0.1:8080. Other pods cannot connect to it on port 8080. What is the issue?