Networking Fundamentals for Engineers

How DNS Resolution Works

Your Kubernetes Ingress returns 503 Service Unavailable. You check the pods, they are running. You check the Service, endpoints exist. You check the Ingress controller logs: "upstream connect error, no healthy upstream." But the Ingress is pointing to the correct service name.

On a hunch, you run dig from your laptop against the Ingress hostname. NXDOMAIN. The hostname does not resolve. The DNS record was deleted during a Terraform apply three hours ago, and nobody noticed because browser DNS caches kept the old answer alive. DNS was the problem the entire time.

This is why you need to understand DNS from the ground up.

The DNS Hierarchy

DNS is a distributed, hierarchical database. No single server knows every hostname on the internet. Instead, the responsibility for answering DNS queries is delegated through a tree structure.

At the top of the tree is the root zone, represented by a dot (.). Below the root are Top-Level Domains (TLDs) like .com, .org, .io, .dev. Below those are second-level domains like devopsbeast.com. Below those can be any number of subdomains: api.devopsbeast.com, staging.api.devopsbeast.com.

Each level in the hierarchy is managed by different organizations:

Root zone: managed by ICANN, operated by 13 root server operators (a.root-servers.net through m.root-servers.net)
TLD (.com, .org): managed by registry operators (Verisign for .com, PIR for .org)
Second-level domain (devopsbeast.com): managed by you, through your DNS provider (Cloudflare, Route53, Google Cloud DNS)

# You can see the hierarchy by querying with +trace
dig devopsbeast.com A +trace

# This shows:
# .                  IN  NS  a.root-servers.net.    <-- Root
# com.               IN  NS  a.gtld-servers.net.    <-- TLD
# devopsbeast.com.   IN  NS  ns1.cloudflare.com.    <-- Authoritative
# devopsbeast.com.   IN  A   104.21.45.67           <-- Answer

KEY CONCEPT

Every fully qualified domain name (FQDN) technically ends with a dot: devopsbeast.com., the trailing dot represents the root zone. When you type devopsbeast.com in a browser, the dot is implied. When debugging DNS, always use the trailing dot in dig queries to be explicit and avoid search domain issues.

The Resolution Flow: Step by Step

When your application needs to resolve devopsbeast.com, here is the complete chain of events:

The Players

Stub resolver: The DNS client on your machine (or pod). It does not resolve anything itself, it just asks a recursive resolver and waits for the answer.

Recursive resolver: The workhorse. This server does the actual work of walking the DNS hierarchy. Common recursive resolvers: your ISP's resolver, Google Public DNS (8.8.8.8), Cloudflare (1.1.1.1), or CoreDNS in Kubernetes.

Root nameserver: Knows which nameservers are authoritative for each TLD. Does not know any actual domain IPs.

TLD nameserver: Knows which nameservers are authoritative for each second-level domain under that TLD. Does not know actual IPs either.

Authoritative nameserver: The final authority for your domain. This server has the actual DNS records (A, AAAA, CNAME, MX, etc.) and returns definitive answers.

DNS Resolution: The Complete Journey

Click each step to explore

In the worst case (cold cache), this process requires four network round-trips: stub to recursive, recursive to root, recursive to TLD, recursive to authoritative. In practice, recursive resolvers cache root and TLD nameserver addresses aggressively, so most queries only require one or two round-trips.

# Measure the full resolution time (cold cache)
dig devopsbeast.com A +stats | grep "Query time"
# ;; Query time: 47 msec

# Query again immediately (warm cache)
dig devopsbeast.com A +stats | grep "Query time"
# ;; Query time: 1 msec    <-- Served from cache

PRO TIP

Use dig +trace to see each step of the resolution chain. This is invaluable when debugging DNS because it shows you exactly where the chain breaks. If the root and TLD respond but the authoritative server does not, you know the problem is at your DNS provider. If the TLD returns wrong NS records, your domain registration is misconfigured.

TTL: Time to Live

Every DNS record has a TTL (Time to Live), a number in seconds that tells resolvers how long to cache the answer. When the TTL expires, the resolver must query the authoritative server again.

# Check the TTL of a record
dig devopsbeast.com A

# ;; ANSWER SECTION:
# devopsbeast.com.    300    IN    A    104.21.45.67
#                     ^^^
#                     TTL = 300 seconds (5 minutes)

TTL is a trade-off:

TTL	Caching	Propagation	DNS Load
60s (1 min)	Minimal, frequent queries	Fast, changes visible in 1 min	High, resolver queries often
300s (5 min)	Good, reduces DNS load	Moderate, 5 min to propagate	Moderate
3600s (1 hour)	Excellent, very few queries	Slow, 1 hour to propagate	Low
86400s (24 hours)	Maximum, almost never queries	Very slow, 24 hours to propagate	Minimal

WARNING

Before making a DNS change (migrating to a new IP, changing providers), always lower the TTL first. If your TTL is 86400 (24 hours) and you change the A record, some users will still see the old IP for up to 24 hours. Lower the TTL to 60 seconds a day before the change, wait for the old TTL to expire, then make the change. After the change propagates, raise the TTL back.

The TTL Countdown

TTL is not static. It counts down from the moment the resolver caches the record. If you query a record with TTL 300 and then query again 100 seconds later, the resolver returns it with TTL 200 (the remaining time).

# Query and note the TTL
dig devopsbeast.com A +noall +answer
# devopsbeast.com.    300    IN    A    104.21.45.67

# Wait 60 seconds, query again
dig devopsbeast.com A +noall +answer
# devopsbeast.com.    240    IN    A    104.21.45.67
#                     ^^^    TTL decreased by 60

Caching Layers

DNS answers are cached at multiple levels. Understanding these layers is critical when debugging "why is my DNS change not taking effect?"

DNS Caching Layers, From Closest to Farthest

Browser DNS Cache

Chrome, Firefox, and Safari all maintain their own DNS caches. Chrome: check at chrome://net-internals/#dns. TTL: usually respects the record TTL, capped at a browser-specific maximum. Clear: close and reopen the browser.

Operating System DNS Cache

macOS (mDNSResponder), Windows (DNS Client service), Linux (systemd-resolved or nscd). Each caches DNS responses. Clear: sudo dscacheutil -flushcache (macOS), ipconfig /flushdns (Windows), systemd-resolve --flush-caches (Linux).

Local Network Cache (Router/VPN)

Home routers, corporate DNS proxies, and VPN DNS servers often cache responses. These are outside your direct control. Clear: restart the router or wait for TTL expiry.

Recursive Resolver Cache

Your configured resolver (1.1.1.1, 8.8.8.8, ISP resolver) caches responses according to TTL. This is the cache that matters most for propagation. Clear: you cannot clear it: wait for TTL expiry.

Authoritative Nameserver

The source of truth. When all caches expire, this is where the resolver gets the definitive answer. If the record is wrong here, everything downstream is wrong.

Hover to expand each layer

WAR STORY

We once changed a CNAME record and waited the full TTL for propagation. The new record worked for everyone except one team, who kept hitting the old IP. After an hour of debugging, we discovered their corporate VPN was running a DNS proxy that cached responses for 24 hours regardless of TTL. The fix was to flush the VPN DNS proxy cache manually. Lesson: you do not control all the caching layers between your users and your authoritative server.

Negative Caching: When NXDOMAIN Gets Stuck

When a domain does not exist, the authoritative server responds with NXDOMAIN (Non-Existent Domain). This response is also cached, according to the SOA record's minimum TTL (also called the negative TTL).

# Query a domain that does not exist
dig nonexistent.devopsbeast.com A

# ;; AUTHORITY SECTION:
# devopsbeast.com.   1800   IN   SOA   ns1.cloudflare.com. dns.cloudflare.com. ...
#                    ^^^^
#                    Negative cache TTL = 1800 seconds (30 minutes)

This means if you create a new DNS record for api.devopsbeast.com, anyone who queried it before it existed (and got NXDOMAIN) will continue getting NXDOMAIN for up to 30 minutes, even though the record now exists.

KEY CONCEPT

Negative caching is one of the most frustrating DNS behaviors. You create a record, test it, and it works. But users report NXDOMAIN. The cause: they queried the domain before the record existed, and their resolver cached the NXDOMAIN response. The fix: wait for the negative TTL to expire. Prevention: create DNS records before pointing traffic to them, not after.

DNS Transport: UDP, TCP, DoH, DoT

DNS has evolved beyond simple UDP queries:

UDP (port 53): The original and most common transport. Fast because no handshake is required. Limited to 512 bytes per response (or 4096 with EDNS0). If the response is too large, the server sets the TC (truncated) flag and the client retries over TCP.

TCP (port 53): Used for large responses (zone transfers, DNSSEC-signed responses with many records). Also used when UDP is blocked. Adds one RTT for the TCP handshake.

DNS over HTTPS (DoH, port 443): DNS queries wrapped in HTTPS. Provides privacy (your ISP cannot see your DNS queries). Used by browsers (Firefox, Chrome). Adds TLS overhead but leverages existing HTTPS infrastructure.

DNS over TLS (DoT, port 853): DNS queries wrapped in TLS. Similar privacy benefits to DoH but uses a dedicated port. Used by Android, systemd-resolved.

# Standard UDP query
dig @1.1.1.1 devopsbeast.com A

# Force TCP
dig @1.1.1.1 devopsbeast.com A +tcp

# DNS over HTTPS (using curl)
curl -s -H "accept: application/dns-json" \
  "https://1.1.1.1/dns-query?name=devopsbeast.com&type=A"
# {"Status":0,"Answer":[{"name":"devopsbeast.com","type":1,"TTL":300,"data":"104.21.45.67"}]}

PRO TIP

In Kubernetes clusters, DNS is always plain UDP on port 53 between pods and CoreDNS. If you need encrypted DNS for external resolution, configure CoreDNS to forward to an upstream resolver over TLS using the forward plugin with the tls:// prefix. This encrypts DNS queries leaving the cluster while keeping internal DNS fast.

DNS Failures and What They Mean

When DNS goes wrong, the error messages are specific and meaningful:

Response	Meaning	Common Cause
NOERROR + answer	Success	Everything works
NOERROR + empty	Domain exists but no records of that type	Querying A when only CNAME exists
NXDOMAIN	Domain does not exist	Typo, deleted record, wrong zone
SERVFAIL	Resolver could not reach authoritative server	Network issue, authoritative server down, DNSSEC failure
REFUSED	Resolver rejected the query	Resolver does not serve that zone, access control
Timeout	No response at all	Firewall blocking UDP 53, resolver unreachable, network down

# Check the response status
dig devopsbeast.com A +noall +comments
# ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345
#                                ^^^^^^^^
#                                This tells you what happened

# SERVFAIL example
dig @10.96.0.10 failing-domain.com A +noall +comments
# ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 67890

WARNING

SERVFAIL is the hardest DNS error to debug because it is a generic failure. It can mean: (1) the authoritative server is unreachable from the resolver, (2) DNSSEC validation failed, (3) the authoritative server returned an invalid response, or (4) the resolver itself is broken. Always check with multiple resolvers (dig @8.8.8.8, dig @1.1.1.1) to isolate whether the problem is your resolver or the authoritative server.

Key Concepts Summary

DNS is a hierarchical, distributed database, root servers delegate to TLD servers, which delegate to authoritative servers
Recursive resolvers do the heavy lifting, they walk the hierarchy and cache results for future queries
TTL controls caching duration: lower TTL means faster propagation but more DNS load
Negative caching (NXDOMAIN) causes the most confusion, a cached "does not exist" response persists even after the record is created
DNS uses UDP port 53 by default: TCP is used for large responses, DoH/DoT for privacy
Caching happens at five layers: browser, OS, local network, recursive resolver, and authoritative server, you only control the last one
NXDOMAIN means the domain does not exist, SERVFAIL means the resolver could not reach the authoritative server, and timeout means something is blocking DNS entirely
Always lower TTL before making DNS changes: drop to 60s, wait for old TTL to expire, then make the change

Common Mistakes

Making a DNS change without lowering the TTL first, users see stale records for hours
Forgetting about negative caching: querying a domain before creating the record poisons caches with NXDOMAIN
Testing DNS changes only from your machine, your cache may be warm while everyone else still has the old answer
Assuming SERVFAIL means the domain does not exist, it means the resolver could not complete the query, which is a different problem entirely
Not checking the authoritative server directly when debugging, always verify with dig @ns1.your-provider.com domain.com A to see the source of truth
Confusing TTL in the answer section (remaining cache time) with TTL at the authoritative server (original TTL)

KNOWLEDGE CHECK

You create a new A record for api.devopsbeast.com. A colleague tests it and gets NXDOMAIN, even though dig against the authoritative server returns the correct IP. What is the most likely cause?

OSI Meets Kubernetes Networking

Continue

DNS Record Types That Matter

←→ navigateM toggle sidebar