Essential Network Commands
You SSH into a production Kubernetes node at 2 AM. An incident is in progress. The monitoring dashboard shows elevated 5xx errors. Developers say "the service is down." Your manager is in the incident channel asking for updates every 3 minutes.
You have no IDE, no fancy GUI tools, no time to read man pages. You need to diagnose the problem with command-line tools you have memorized. The seven commands in this lesson are the tools that senior engineers reach for instinctively. Learn them once, and you will use them for the rest of your career.
Command 1: curl — The Swiss Army Knife of HTTP
curl makes HTTP requests from the command line. It is the single most important debugging tool for any engineer working with web services.
The Essential Flags
# -v: Verbose output (shows headers, TLS handshake, timing)
# This is the flag you use 90% of the time
curl -v http://api-service:8080/health
# * Trying 10.96.45.123:8080...
# * Connected to api-service (10.96.45.123) port 8080
# > GET /health HTTP/1.1
# > Host: api-service:8080
# > User-Agent: curl/7.88.1
# > Accept: */*
# >
# < HTTP/1.1 200 OK
# < Content-Type: application/json
# < Content-Length: 22
# <
# {"status":"healthy"}
# -I: HEAD request (headers only, no body)
# Quick check: is the service responding? What status code?
curl -I http://api-service:8080/health
# HTTP/1.1 200 OK
# Content-Type: application/json
# Content-Length: 22
# -o /dev/null -s -w: Custom output format
# Extract specific information (status code, timing)
curl -o /dev/null -s -w "HTTP %{http_code} | Total: %{time_total}s | Connect: %{time_connect}s | TTFB: %{time_starttransfer}s\n" \
http://api-service:8080/health
# HTTP 200 | Total: 0.045s | Connect: 0.002s | TTFB: 0.043s
The -w (write-out) flag is incredibly powerful for diagnosing latency issues. time_connect shows TCP handshake time (network latency). time_starttransfer shows Time To First Byte (TTFB — includes server processing time). If time_connect is high, the network is slow. If time_starttransfer is high but time_connect is low, the server is slow. This single command answers "is it the network or the app?"
Practical Scenarios
# Scenario 1: Test an endpoint with a specific Host header
# (Useful when DNS does not resolve but you know the IP)
curl -v -H "Host: api.example.com" http://10.96.45.123:80/health
# Scenario 2: Skip TLS verification (self-signed certs in dev/staging)
curl -vk https://api-service:8443/health
# Scenario 3: Set a connection timeout (do not hang forever)
curl -v --connect-timeout 5 --max-time 10 http://api-service:8080/health
# --connect-timeout: max seconds for TCP handshake
# --max-time: max seconds for the entire request
# Scenario 4: Force resolution to a specific IP
# (Test a specific pod without changing DNS)
curl -v --resolve api.example.com:443:10.244.1.5 https://api.example.com/health
# Scenario 5: Send a POST request with JSON body
curl -v -X POST -H "Content-Type: application/json" \
-d '{"key": "value"}' \
http://api-service:8080/api/data
# Scenario 6: Follow redirects
curl -vL http://api-service:8080/old-path
# -L follows 301/302 redirects automatically
Use --resolve instead of editing /etc/hosts when you need to test a specific backend IP with a hostname. It works for a single curl request without affecting the rest of the system. This is perfect for testing whether a specific pod is healthy without going through the load balancer.
Command 2: dig — DNS Query Tool
dig (Domain Information Groper) queries DNS servers directly. It shows you exactly what DNS returns, including TTL, record type, and which server answered.
The Essential Patterns
# Basic query: what IP does this hostname resolve to?
dig api-service.production.svc.cluster.local
# ;; ANSWER SECTION:
# api-service.production.svc.cluster.local. 30 IN A 10.96.45.123
# +short: just the answer, no noise
dig api-service.production.svc.cluster.local +short
# 10.96.45.123
# Query a specific DNS server (CoreDNS in K8s)
dig @10.96.0.10 api-service.production.svc.cluster.local +short
# 10.96.45.123
# (@10.96.0.10 is typically the CoreDNS ClusterIP)
# Query specific record types
dig api.example.com A # IPv4 address
dig api.example.com AAAA # IPv6 address
dig api.example.com CNAME # Canonical name (alias)
dig api.example.com MX # Mail exchange
dig api.example.com TXT # Text records (SPF, DKIM, etc.)
dig api.example.com NS # Name servers
dig api.example.com SRV # Service records (used by K8s headless Services)
DNS Debugging in Kubernetes
# Check if CoreDNS is resolving cluster-internal names
dig @10.96.0.10 kubernetes.default.svc.cluster.local +short
# 10.96.0.1 (the API server ClusterIP — always exists)
# Check a headless Service (returns pod IPs, not ClusterIP)
dig @10.96.0.10 database.production.svc.cluster.local +short
# 10.244.1.5
# 10.244.2.8
# 10.244.3.2
# Check SRV records for a headless Service (includes port)
dig @10.96.0.10 _tcp.database.production.svc.cluster.local SRV +short
# 0 33 5432 database-0.database.production.svc.cluster.local.
# 0 33 5432 database-1.database.production.svc.cluster.local.
# 0 33 5432 database-2.database.production.svc.cluster.local.
# Check external DNS resolution (goes through CoreDNS upstream)
dig @10.96.0.10 google.com +short
# 142.250.80.46
# If this fails: CoreDNS cannot reach upstream DNS servers
When debugging DNS in Kubernetes, always use dig @10.96.0.10 (or whatever your CoreDNS ClusterIP is) to explicitly query the cluster DNS server. If you just run dig hostname, it uses /etc/resolv.conf, which inside a pod points to CoreDNS, but on a node might point to the cloud DNS (169.254.169.253 on AWS). This distinction matters when the problem is specific to cluster DNS.
The +trace Flag — Follow the Full Resolution Path
# Trace the full DNS resolution path from root servers down
dig +trace api.example.com
# . 518400 IN NS a.root-servers.net.
# com. 172800 IN NS a.gtld-servers.net.
# example.com. 172800 IN NS ns1.example.com.
# api.example.com. 300 IN A 203.0.113.50
#
# This shows every step of DNS resolution:
# Root → .com TLD → example.com authoritative → final answer
# If resolution fails, +trace shows exactly WHERE it fails
DNS Resolution Path (dig +trace)
Click each step to explore
Command 3: nslookup — Simpler DNS Lookup
nslookup is a simpler alternative to dig. It does not show as much detail but is available on more systems (including many minimal container images).
# Basic lookup
nslookup api-service.production.svc.cluster.local
# Server: 10.96.0.10
# Address: 10.96.0.10#53
#
# Name: api-service.production.svc.cluster.local
# Address: 10.96.45.123
# Query a specific DNS server
nslookup api-service.production.svc.cluster.local 10.96.0.10
# Reverse lookup (IP to hostname)
nslookup 10.96.45.123
# 123.45.96.10.in-addr.arpa name = api-service.production.svc.cluster.local
Use dig when you need detailed DNS debugging (TTL, authoritative server, trace). Use nslookup when you just need a quick "does this hostname resolve?" check and dig is not installed. In Kubernetes debug pods (netshoot), both are available — prefer dig for its richer output.
Command 4: nc (netcat) — TCP/UDP Connectivity Tester
Netcat is the simplest way to test whether you can establish a TCP connection to a specific host and port. It answers the question: "Is the port open?"
# Test TCP connectivity (-z: scan mode, -v: verbose)
nc -zv api-service 8080
# Connection to api-service 8080 port [tcp/*] succeeded!
# → Port is open, something is listening
nc -zv api-service 8080
# nc: connect to api-service port 8080 (tcp) failed: Connection refused
# → Port is NOT open (nothing listening, or wrong port)
nc -zv api-service 8080
# nc: connect to api-service port 8080 (tcp) failed: Operation timed out
# → Port is firewalled (packets dropped, no response)
# Test with a timeout (do not hang forever)
nc -zv -w 3 api-service 8080
# -w 3: timeout after 3 seconds
# Scan a range of ports
nc -zv api-service 8080-8090
# Connection to api-service 8080 port [tcp/*] succeeded!
# Connection to api-service 8081 port [tcp/*] succeeded!
# nc: connect to api-service port 8082 (tcp) failed: Connection refused
# ...
# Test UDP connectivity (less reliable — UDP has no handshake)
nc -zuv api-service 53
# Connection to api-service 53 port [udp/domain] succeeded!
The three results from nc -zv each mean something different. succeeded = port open, app is listening. Connection refused = host reachable but nothing on that port (RST packet received). Operation timed out = packets are being dropped (firewall, NetworkPolicy). Learn to distinguish these — they point to completely different root causes.
netcat as a Lightweight Server
# Start a listener on port 9090 (useful for testing connectivity)
nc -l -p 9090
# On another pod, test sending data:
echo "hello" | nc -w 1 debug-pod 9090
# If "hello" appears on the listener, end-to-end connectivity works
# This is useful for testing NetworkPolicies:
# 1. Start a listener in the destination pod
# 2. Send data from the source pod
# 3. If data arrives, the NetworkPolicy allows the traffic
Command 5: ss — Socket Statistics
ss shows network connections, listening ports, and socket statistics. It replaces the older netstat command and is faster on systems with many connections.
# Show all listening TCP ports with process names
ss -tlnp
# State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
# LISTEN 0 128 0.0.0.0:8080 0.0.0.0:* users:(("java",pid=1,fd=5))
# LISTEN 0 128 127.0.0.1:9090 0.0.0.0:* users:(("java",pid=1,fd=8))
#
# Flags: -t (TCP), -l (listening), -n (numeric), -p (process)
# CRITICAL INSIGHT from the output above:
# Port 8080 listens on 0.0.0.0 → accessible from ANY IP (correct)
# Port 9090 listens on 127.0.0.1 → accessible ONLY from localhost (problem if external access needed)
When ss -tlnp shows a service listening on 127.0.0.1:PORT, it is only reachable from within the same pod/container. External connections (from other pods, from the Service) will get "Connection refused." This is the #1 silent killer in container networking. The fix is in the application config — change the bind address to 0.0.0.0.
Practical ss Patterns
# Show established connections to a specific port
ss -tnp state established '( dport = :5432 )'
# Shows all connections to PostgreSQL — useful for seeing connection count
# Count connections by state
ss -t state established | wc -l
# 245 ← 245 established TCP connections
ss -t state time-wait | wc -l
# 1203 ← 1203 connections in TIME_WAIT (high number = connection churn)
# Show all connections to a specific remote IP
ss -tnp dst 10.244.2.8
# Shows all connections from this host to a specific pod
# Show socket memory usage (debug buffer issues)
ss -tm
# Shows send/receive buffer sizes per connection
# Show summary statistics
ss -s
# Total: 534 (kernel 612)
# TCP: 402 (estab 245, closed 89, orphaned 2, synrecv 0, timewait 68/0)
When to Use Which Command
Quick Connectivity Checks
Is the target reachable?
Deep Investigation
What exactly is happening?
Command 6: traceroute & mtr — Path Tracing
traceroute shows the network path (hops) between your machine and the destination. Each hop is a router that forwards your packet.
# Basic traceroute
traceroute -n 10.244.2.8
# traceroute to 10.244.2.8 (10.244.2.8), 30 hops max, 60 byte packets
# 1 10.244.1.1 0.5 ms 0.4 ms 0.5 ms # Pod gateway
# 2 10.0.1.1 1.2 ms 1.1 ms 1.3 ms # Node network
# 3 10.0.2.1 1.8 ms 1.7 ms 1.9 ms # Destination node
# 4 10.244.2.8 2.1 ms 2.0 ms 2.2 ms # Destination pod
#
# Flags: -n (numeric — do not resolve hostnames, much faster)
# TCP traceroute (bypasses ICMP firewalls)
traceroute -T -p 8080 10.244.2.8
# Uses TCP SYN instead of ICMP — more likely to succeed in
# environments that block ICMP but allow TCP
# When hops show * * *:
traceroute -n 10.244.2.8
# 1 10.244.1.1 0.5 ms 0.4 ms 0.5 ms
# 2 10.0.1.1 1.2 ms 1.1 ms 1.3 ms
# 3 * * * # ← This hop drops/blocks
# 4 * * * # traceroute packets
# This does NOT necessarily mean the network is broken.
# Many routers drop ICMP/UDP traceroute packets by policy.
# Try -T (TCP) mode, or test with nc -zv instead.
mtr — Continuous traceroute
mtr combines traceroute and ping into a continuous monitoring tool. It sends packets repeatedly and shows real-time statistics per hop.
# Run mtr (interactive mode)
mtr -n 10.244.2.8
# HOST Loss% Snt Last Avg Best Wrst StDev
# 1. 10.244.1.1 0.0% 50 0.5 0.5 0.3 1.2 0.1
# 2. 10.0.1.1 0.0% 50 1.2 1.1 0.8 2.1 0.2
# 3. 10.0.2.1 2.0% 50 1.8 1.7 1.4 3.5 0.4 ← 2% loss
# 4. 10.244.2.8 2.0% 50 2.1 2.0 1.7 3.8 0.3
# Report mode (run 100 packets and print summary)
mtr -n --report -c 100 10.244.2.8
# The columns:
# Loss% = packet loss at this hop
# Snt = packets sent
# Last = last round-trip time
# Avg = average round-trip time
# Best = minimum round-trip time
# Wrst = maximum round-trip time (worst)
# StDev = standard deviation (high = inconsistent)
When reading mtr output, look for where packet loss STARTS. If hop 3 shows 2% loss and hop 4 also shows 2%, the problem is at hop 3 (the loss propagates downstream). If hop 3 shows 2% loss but hop 4 shows 0%, hop 3 is just rate-limiting ICMP responses — not actually dropping traffic. Only the FINAL hop loss percentage matters for real packet loss. Intermediate hops often deprioritize ICMP and show false packet loss.
Command 7: ip — Interface and Route Information
The ip command shows and manipulates network interfaces, routes, and ARP tables. It replaces ifconfig, route, and arp.
# Show all network interfaces and their IPs
ip addr show
# 1: lo: <LOOPBACK,UP,LOWER_UP>
# inet 127.0.0.1/8 scope host lo
# 2: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP>
# inet 10.244.1.5/32 scope global eth0
# ^^^^^^^^^^^ This is the pod IP
# Show the routing table
ip route show
# default via 10.244.1.1 dev eth0
# 10.244.1.0/24 dev eth0 scope link
# 10.96.0.0/12 via 10.244.1.1 dev eth0
#
# This tells you:
# - Default gateway: 10.244.1.1 (traffic goes here if no specific route)
# - Local subnet: 10.244.1.0/24 (direct, no gateway)
# - Service CIDR: 10.96.0.0/12 (routed via gateway — because ClusterIPs
# are virtual and handled by iptables on the node)
# Check route to a specific destination
ip route get 10.244.2.8
# 10.244.2.8 via 10.244.1.1 dev eth0 src 10.244.1.5
# → Traffic to 10.244.2.8 goes via gateway 10.244.1.1
# Show ARP/neighbor table (L2 adjacency)
ip neigh show
# 10.244.1.1 dev eth0 lladdr aa:bb:cc:dd:ee:ff REACHABLE
Inside a Kubernetes pod, ip route show reveals how the CNI configured networking. The default route goes to the node (via the pod gateway). The pod CIDR for the local node is a direct route. Service CIDRs are routed via the gateway because ClusterIPs are virtual — they are intercepted by iptables/IPVS on the node, not by actual routing. If ip route show inside a pod shows no default route, the CNI failed to set up networking for that pod.
Bonus Commands
These are not in the "top 7" but are worth knowing:
# wget: Download files (available in more minimal images than curl)
wget -qO- http://api-service:8080/health
# -q: quiet, -O-: output to stdout
# telnet: Simple TCP connectivity test (if nc is not available)
telnet api-service 8080
# Trying 10.96.45.123...
# Connected to api-service.
# arp: Show ARP table (older, use ip neigh instead)
arp -n
# ethtool: Show NIC details (on nodes, not in pods)
ethtool eth0
# Speed: 25000Mb/s
# Link detected: yes
# iperf3: Network bandwidth testing
# Server: iperf3 -s
# Client: iperf3 -c server-ip
# Shows throughput between two endpoints
Putting It All Together — A Real Debug Session
Here is a complete debug session using all seven commands:
# Incident: "Frontend cannot reach the API service"
# Step 1: What is the symptom?
curl -v --connect-timeout 5 http://api-service.production:80/health
# * Trying 10.96.45.123:80...
# * Connection timed out
# Symptom: TIMEOUT → L3/L4 issue (not L7)
# Step 2: Does DNS resolve?
dig api-service.production.svc.cluster.local +short
# 10.96.45.123 ← DNS is fine
# Step 3: Can we reach the ClusterIP?
nc -zv -w 3 10.96.45.123 80
# Timeout ← Cannot reach ClusterIP on port 80
# Step 4: Can we reach a pod directly?
kubectl get endpoints api-service -n production
# 10.244.2.8:8080,10.244.3.2:8080 ← Endpoints exist
nc -zv -w 3 10.244.2.8 8080
# Connection succeeded! ← Pod is reachable directly
# Conclusion: Pod reachable, ClusterIP not reachable
# Issue is in kube-proxy / iptables layer
# Step 5: Check kube-proxy
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# NAME READY STATUS AGE
# kube-proxy-abc12 0/1 CrashLoopBackOff 5m ← There it is!
# kube-proxy crashed → iptables rules not updated → ClusterIP not working
# Fix: investigate kube-proxy crash logs, restart
kubectl logs -n kube-system kube-proxy-abc12
This exact scenario happened in production. kube-proxy crashed after a node kernel upgrade changed the iptables binary path. The existing iptables rules continued to work (they are in the kernel, not in the kube-proxy process), but new Services and Endpoint changes were not being programmed. Old Services worked fine, but any Service created or scaled after the crash was unreachable. It took 30 minutes to notice because "most things work" masked the partial failure.
Key Concepts Summary
- curl -v is your primary HTTP debugging tool — use
-wfor timing analysis and--resolveto test specific backends - dig with
+shortfor quick DNS lookups,+tracefor resolution path debugging,@serverfor querying specific DNS servers - nc -zv answers "is this port open?" — distinguish between succeeded (open), refused (nothing listening), and timeout (firewalled)
- ss -tlnp shows what is listening and on which address — 127.0.0.1 vs 0.0.0.0 is the most common misconfiguration
- traceroute -T uses TCP instead of ICMP — more reliable in environments that block ICMP
- mtr combines traceroute and ping for continuous monitoring — only the final hop loss matters, intermediate hops often show false loss
- ip route show reveals how the CNI configured pod networking — no default route means CNI setup failed
Common Mistakes
- Running
dig hostnameon a node and expecting Kubernetes DNS results — nodes use different resolvers than pods, usedig @10.96.0.10 hostname - Interpreting intermediate mtr hop loss as real packet loss — many routers deprioritize ICMP and show false loss
- Using
curlwithout-v— you miss headers, TLS details, and timing that are critical for diagnosis - Forgetting
-non traceroute — reverse DNS lookups add seconds of delay per hop, making the output painfully slow - Not checking
ss -tlnpearly in debugging — spending hours on network investigation when the app is listening on the wrong address or port - Using
pingto test connectivity and concluding the network is broken when ICMP is simply blocked - Not setting timeouts on
ncandcurl— commands hang for minutes waiting for firewalled connections
You run ss -tlnp inside a pod and see the application listening on 127.0.0.1:8080. Other pods cannot connect to it on port 8080. What is the issue?