Networking Fundamentals for Engineers

DNS Debugging with dig, nslookup & tcpdump

It is 2 AM. PagerDuty fires. Your monitoring system reports that 30% of API requests are failing with "no such host" errors. Some pods can resolve DNS, others cannot. The failures are intermittent, a pod fails one query and succeeds the next. CoreDNS is running. The upstream resolver is healthy.

This is when you need more than nslookup. You need dig with its full arsenal of flags, tcpdump to capture what is actually on the wire, and a systematic approach to isolating where in the DNS chain the failure occurs.

dig: The Power Tool

dig (Domain Information Groper) is the most powerful DNS debugging tool. Unlike nslookup, dig gives you complete control over every aspect of the query and shows you the full response with all sections.

Basic Usage

# Simple A record lookup
dig devopsbeast.com A

# Output sections:
# ;; QUESTION SECTION:     <-- What you asked
# ;devopsbeast.com.        IN    A
#
# ;; ANSWER SECTION:       <-- The answer
# devopsbeast.com.    300   IN    A    104.21.45.67
#
# ;; AUTHORITY SECTION:    <-- Who is authoritative
# devopsbeast.com.    1800  IN    NS   ns1.cloudflare.com.
#
# ;; ADDITIONAL SECTION:   <-- Extra info (glue records)
# ns1.cloudflare.com.  300  IN    A    173.245.58.51
#
# ;; Query time: 12 msec
# ;; SERVER: 1.1.1.1#53
# ;; MSG SIZE  rcvd: 123

Essential dig Flags

# Query a specific DNS server
dig @8.8.8.8 devopsbeast.com A
dig @10.96.0.10 my-service.default.svc.cluster.local A   # CoreDNS

# Short output (just the answer)
dig devopsbeast.com A +short
# 104.21.45.67

# Show only the answer section
dig devopsbeast.com A +noall +answer
# devopsbeast.com.    300    IN    A    104.21.45.67

# Trace the full resolution chain
dig devopsbeast.com A +trace
# Shows: root → TLD → authoritative → answer

# Check specific record types
dig devopsbeast.com MX +short
dig devopsbeast.com TXT +short
dig devopsbeast.com NS +short
dig devopsbeast.com SOA +short
dig devopsbeast.com ANY +short   # All records (many servers block this)

# Reverse DNS lookup
dig -x 8.8.8.8 +short
# dns.google.

# Query with TCP instead of UDP
dig devopsbeast.com A +tcp

# Set a custom timeout (in seconds)
dig devopsbeast.com A +time=2 +tries=1

KEY CONCEPT

The +trace flag is your most powerful debugging tool. It shows every step of the resolution chain: root, TLD, authoritative. When DNS is broken, +trace tells you exactly where the chain breaks. If root and TLD respond but the authoritative server does not, the problem is at your DNS provider. If the TLD returns wrong NS records, your domain registration is misconfigured.

dig Inside Kubernetes Pods

Most minimal container images do not include dig. You have several options:

# Option 1: Use a debug container with dig installed
kubectl run debug --image=nicolaka/netshoot --rm -it -- bash
dig my-service.default.svc.cluster.local A

# Option 2: Use kubectl debug (ephemeral containers)
kubectl debug -it my-pod --image=nicolaka/netshoot -- dig google.com

# Option 3: Install dig in a running pod (if you have access)
kubectl exec my-pod -- apt-get update && apt-get install -y dnsutils
kubectl exec my-pod -- dig my-service.default.svc.cluster.local A

PRO TIP

The nicolaka/netshoot image is the gold standard for network debugging in Kubernetes. It includes dig, nslookup, curl, tcpdump, ping, traceroute, netstat, ss, iperf, and dozens of other networking tools. Keep it bookmarked. When you need to debug networking issues, kubectl run debug --image=nicolaka/netshoot --rm -it -- bash is your starting command.

nslookup: Quick and Simple

nslookup is simpler than dig but is available in more container images. It is good for quick checks but lacks the detailed output dig provides.

# Basic lookup
nslookup devopsbeast.com
# Server:     1.1.1.1
# Address:    1.1.1.1#53
# Non-authoritative answer:
# Name:    devopsbeast.com
# Address: 104.21.45.67

# Query a specific server
nslookup devopsbeast.com 8.8.8.8

# Query specific record type
nslookup -type=MX devopsbeast.com
nslookup -type=TXT devopsbeast.com
nslookup -type=SRV _http._tcp.my-headless.default.svc.cluster.local

# From inside a K8s pod
kubectl exec my-pod -- nslookup my-service.default.svc.cluster.local
kubectl exec my-pod -- nslookup google.com

dig vs nslookup: When to Use Which

dig vs nslookup

dig

The power tool for deep debugging

Output detailFull DNS response with all sections

Trace resolution+trace shows the full hierarchy walk

Query controlCustom timeout, retries, TCP/UDP, EDNS

ScriptingExcellent: parseable output with +short

AvailabilityRequires dnsutils package (not in minimal images)

Best forDetailed debugging, tracing resolution chains

nslookup

Quick checks and basic troubleshooting

Output detailBasic, server, name, address

Trace resolutionNot available

Query controlLimited, server and record type only

ScriptingHarder to parse output

AvailabilityAvailable in more images (busybox includes it)

Best forQuick yes/no checks: does this name resolve?

WARNING

nslookup and dig may give different results because they use different resolvers. nslookup uses the system resolver (respects /etc/resolv.conf, including search domains and ndots). dig queries the DNS server directly by default and does NOT apply search domains unless you explicitly add +search. When debugging K8s DNS, always specify the full domain name to avoid confusion.

tcpdump for DNS: Seeing What Is on the Wire

When dig and nslookup are not enough, when DNS works sometimes but not others, or when you suspect packets are being dropped, you need tcpdump to see the actual DNS packets on the network.

Capturing DNS Traffic

# Capture all DNS traffic on a node
sudo tcpdump -i any port 53 -nn

# Output:
# 10:00:00.001 IP 10.244.0.5.43210 > 10.96.0.10.53: 12345+ A? google.com.default.svc.cluster.local. (54)
# 10:00:00.002 IP 10.96.0.10.53 > 10.244.0.5.43210: 12345 NXDomain 0/1/0 (107)
# 10:00:00.003 IP 10.244.0.5.43210 > 10.96.0.10.53: 12346+ A? google.com.svc.cluster.local. (46)
# ...

# Decode:
# 10.244.0.5.43210    = Source pod IP, ephemeral port
# 10.96.0.10.53       = CoreDNS service IP, port 53
# 12345+              = DNS query ID, + means recursion desired
# A?                  = Query type (A record)
# NXDomain            = Response: domain does not exist

Useful tcpdump Filters for DNS

# DNS traffic from a specific pod IP
sudo tcpdump -i any host 10.244.0.5 and port 53 -nn

# Only DNS queries (not responses) — filter by source port != 53
sudo tcpdump -i any dst port 53 -nn

# Only DNS responses
sudo tcpdump -i any src port 53 -nn

# Capture to a file for later analysis with Wireshark
sudo tcpdump -i any port 53 -w /tmp/dns-capture.pcap -nn

# Capture DNS traffic on a specific interface with verbose decoding
sudo tcpdump -i eth0 port 53 -vv -nn

Inside Kubernetes Pods

# Run tcpdump in a debug pod on the same node
kubectl debug node/my-node -it --image=nicolaka/netshoot -- \
  tcpdump -i any port 53 -nn

# Or run tcpdump in the CoreDNS pod's network namespace
# (requires privileged access)
kubectl exec -n kube-system coredns-abc123 -- \
  tcpdump -i any port 53 -nn -c 20

KEY CONCEPT

tcpdump shows you the ground truth. When dig says the query timed out, tcpdump tells you whether the query packet was actually sent, whether a response was received, or whether the packet was silently dropped. This is the difference between "DNS is broken" and "a firewall is dropping UDP packets on port 53."

The DNS Debugging Checklist

When a pod cannot resolve a hostname, follow this systematic approach:

DNS Debugging Decision Tree

Click each step to explore

Debugging Scenario 1: NXDOMAIN for a Service

# Step 1: Query from the pod
kubectl exec app-pod -- nslookup my-service.production.svc.cluster.local
# ** server cannot find my-service.production.svc.cluster.local: NXDOMAIN

# Step 2: Verify the service exists
kubectl get svc my-service -n production
# Error: services "my-service" not found
# CAUSE: The service does not exist in that namespace!

# Or:
kubectl get svc my-service -n production
# NAME         TYPE        CLUSTER-IP   PORT(S)
# my-service   ClusterIP   10.96.0.42   8080/TCP

kubectl get endpoints my-service -n production
# NAME         ENDPOINTS
# my-service   <none>
# CAUSE: Service exists but has no endpoints — no pods match the selector!

kubectl describe svc my-service -n production | grep Selector
# Selector: app=my-svc    <-- Does any pod have this label?
kubectl get pods -n production -l app=my-svc
# No resources found.    <-- No pods match!

Debugging Scenario 2: SERVFAIL for External Domains

# Step 1: Query external domain from the pod
kubectl exec debug -- dig google.com A
# ;; ->>HEADER<<- status: SERVFAIL

# Step 2: Query CoreDNS directly
kubectl exec debug -- dig @10.96.0.10 google.com A
# ;; ->>HEADER<<- status: SERVFAIL

# Step 3: Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=20
# [ERROR] plugin/forward: no nameservers found
# OR
# [ERROR] plugin/forward: unreachable backend

# Step 4: Check what CoreDNS is forwarding to
kubectl get configmap coredns -n kube-system -o yaml | grep forward
# forward . /etc/resolv.conf

# Step 5: Check the resolv.conf inside CoreDNS pod
kubectl exec -n kube-system coredns-abc123 -- cat /etc/resolv.conf
# nameserver 169.254.169.253    <-- VPC resolver (AWS)

# Step 6: Test upstream from CoreDNS pod
kubectl exec -n kube-system coredns-abc123 -- nslookup google.com 169.254.169.253
# ;; connection timed out
# CAUSE: CoreDNS cannot reach the upstream resolver!
# Check: NetworkPolicy blocking kube-system egress? VPC resolver down?

Debugging Scenario 3: Intermittent DNS Timeouts

# Step 1: Check CoreDNS resource usage
kubectl top pods -n kube-system -l k8s-app=kube-dns
# NAME                       CPU    MEMORY
# coredns-abc123             450m   96Mi
# coredns-def456             490m   98Mi
# Resource limits might be too low

# Step 2: Check CoreDNS metrics
kubectl port-forward -n kube-system svc/kube-dns 9153:9153
curl -s localhost:9153/metrics | grep coredns_dns_request_duration_seconds
# Look for high latency percentiles

# Step 3: Check conntrack on nodes
ssh node-1 'cat /proc/sys/net/netfilter/nf_conntrack_count'
# 62000
ssh node-1 'cat /proc/sys/net/netfilter/nf_conntrack_max'
# 65536
# CAUSE: conntrack table almost full — DNS packets being dropped

# Step 4: Check for packet drops
ssh node-1 'netstat -s | grep -i drop'
# InErrors: 0
# NoPorts: 0
ssh node-1 'conntrack -S | grep drop'
# drop=1523    <-- Packets dropped due to conntrack full!

WAR STORY

The most insidious DNS bug I have encountered: a cluster where DNS worked for 99% of queries but failed for exactly one specific external domain. The domain had a DNSSEC-signed zone with an invalid signature. CoreDNS was performing DNSSEC validation (enabled by default in some configurations), and the validation failure returned SERVFAIL. Every other domain resolved fine. The fix: either disable DNSSEC validation in CoreDNS (not ideal) or contact the domain owner to fix their DNSSEC configuration.

Advanced dig Techniques

Querying Authoritative Servers Directly

When you suspect caching is the issue, bypass all caches and query the authoritative server directly:

# Find the authoritative nameservers
dig devopsbeast.com NS +short
# ns1.cloudflare.com.
# ns2.cloudflare.com.

# Query the authoritative server directly
dig @ns1.cloudflare.com devopsbeast.com A +short
# 104.21.45.67
# This answer is fresh from the source — no caching involved

Checking DNSSEC

# Check if a domain has DNSSEC
dig devopsbeast.com A +dnssec +short
# If RRSIG records appear, DNSSEC is enabled

# Validate DNSSEC chain
dig devopsbeast.com A +sigchase +trusted-key=/etc/trusted-key.key

# Check DS records at the parent zone
dig devopsbeast.com DS +short
# 12345 13 2 abc123...

Measuring DNS Performance

# Time a single query
dig devopsbeast.com A +stats | grep "Query time"
# ;; Query time: 3 msec

# Batch timing with multiple queries
for i in $(seq 1 100); do
  dig devopsbeast.com A +stats +tries=1 +time=2 2>/dev/null | grep "Query time"
done | sort -t: -k2 -n | tail -5
# Shows the 5 slowest queries

# Test from inside a K8s pod
kubectl exec debug -- sh -c 'for i in $(seq 1 20); do dig my-service.default.svc.cluster.local A +stats +short +tries=1 2>/dev/null | tail -3; done'

PRO TIP

When measuring DNS performance in Kubernetes, always test both cluster-internal names (my-service.default.svc.cluster.local) and external names (google.com) separately. Slow internal resolution points to CoreDNS issues. Slow external resolution points to upstream resolver issues or ndots overhead. Different root causes require different fixes.

Building a DNS Debug Container

For teams that debug DNS frequently, create a purpose-built debug container:

FROM alpine:3.19
RUN apk add --no-cache \
    bind-tools \
    curl \
    tcpdump \
    netcat-openbsd \
    busybox-extras \
    jq
# bind-tools gives us dig and nslookup
# tcpdump for packet capture
# netcat for port testing
# jq for parsing DNS-over-HTTPS JSON responses

CMD ["sleep", "infinity"]

# Deploy it
kubectl apply -f - <<ENDOFFILE
apiVersion: v1
kind: Pod
metadata:
  name: dns-debug
  namespace: default
spec:
  containers:
    - name: debug
      image: your-registry/dns-debug:latest
      command: ["sleep", "infinity"]
ENDOFFILE

# Use it
kubectl exec -it dns-debug -- dig my-service.default.svc.cluster.local A
kubectl exec -it dns-debug -- tcpdump -i eth0 port 53 -nn -c 50

Key Concepts Summary

dig is the primary DNS debugging tool, use +trace to see the full resolution chain, +short for concise output, @server to query specific resolvers
nslookup is simpler but limited: good for quick checks, but lacks trace, custom timeouts, and detailed output
tcpdump reveals ground truth: when dig says "timeout," tcpdump shows whether the packet was sent and whether a response came back
Always debug systematically: pod resolv.conf, CoreDNS direct query, CoreDNS health, upstream resolver, packet capture
The nicolaka/netshoot image has every network debugging tool you need, keep it bookmarked
Query authoritative servers directly to bypass caching issues, dig @ns1.provider.com domain.com A
NXDOMAIN usually means: typo, wrong namespace, or service does not exist, check with kubectl get svc and kubectl get endpoints
SERVFAIL usually means: CoreDNS cannot reach upstream, DNSSEC validation failure, or the authoritative server is broken
Timeouts usually mean: firewall blocking UDP 53, CoreDNS overloaded, or conntrack table full

Common Mistakes

Using nslookup inside a pod and forgetting it applies search domains, a query for google.com might resolve as google.com.default.svc.cluster.local first
Running dig without specifying a server (@10.96.0.10) and getting results from a different resolver than the pod uses
Not checking the pod's /etc/resolv.conf, the pod might have dnsPolicy: None with no DNS configuration
Forgetting that dig does not apply search domains by default, use +search flag or provide the FQDN with trailing dot
Debugging DNS from outside the cluster and assuming results match in-cluster behavior, always debug from inside a pod
Not checking CoreDNS logs: they often contain the exact error message explaining the failure

KNOWLEDGE CHECK

You run dig google.com from inside a Kubernetes pod and it returns SERVFAIL. You run dig @8.8.8.8 google.com from the same pod and it succeeds. What does this tell you?

CoreDNS in Kubernetes

Continue

TCP, The Three-Way Handshake & Connection Lifecycle

←→ navigateM toggle sidebar