How DNS Resolution Works
Your Kubernetes Ingress returns 503 Service Unavailable. You check the pods — they are running. You check the Service — endpoints exist. You check the Ingress controller logs: "upstream connect error, no healthy upstream." But the Ingress is pointing to the correct service name.
On a hunch, you run
digfrom your laptop against the Ingress hostname.NXDOMAIN. The hostname does not resolve. The DNS record was deleted during a Terraform apply three hours ago, and nobody noticed because browser DNS caches kept the old answer alive. DNS was the problem the entire time.This is why you need to understand DNS from the ground up.
The DNS Hierarchy
DNS is a distributed, hierarchical database. No single server knows every hostname on the internet. Instead, the responsibility for answering DNS queries is delegated through a tree structure.
At the top of the tree is the root zone, represented by a dot (.). Below the root are Top-Level Domains (TLDs) like .com, .org, .io, .dev. Below those are second-level domains like devopsbeast.com. Below those can be any number of subdomains: api.devopsbeast.com, staging.api.devopsbeast.com.
Each level in the hierarchy is managed by different organizations:
- Root zone: managed by ICANN, operated by 13 root server operators (a.root-servers.net through m.root-servers.net)
- TLD (.com, .org): managed by registry operators (Verisign for .com, PIR for .org)
- Second-level domain (devopsbeast.com): managed by you, through your DNS provider (Cloudflare, Route53, Google Cloud DNS)
# You can see the hierarchy by querying with +trace
dig devopsbeast.com A +trace
# This shows:
# . IN NS a.root-servers.net. <-- Root
# com. IN NS a.gtld-servers.net. <-- TLD
# devopsbeast.com. IN NS ns1.cloudflare.com. <-- Authoritative
# devopsbeast.com. IN A 104.21.45.67 <-- Answer
Every fully qualified domain name (FQDN) technically ends with a dot: devopsbeast.com. — the trailing dot represents the root zone. When you type devopsbeast.com in a browser, the dot is implied. When debugging DNS, always use the trailing dot in dig queries to be explicit and avoid search domain issues.
The Resolution Flow: Step by Step
When your application needs to resolve devopsbeast.com, here is the complete chain of events:
The Players
Stub resolver: The DNS client on your machine (or pod). It does not resolve anything itself — it just asks a recursive resolver and waits for the answer.
Recursive resolver: The workhorse. This server does the actual work of walking the DNS hierarchy. Common recursive resolvers: your ISP's resolver, Google Public DNS (8.8.8.8), Cloudflare (1.1.1.1), or CoreDNS in Kubernetes.
Root nameserver: Knows which nameservers are authoritative for each TLD. Does not know any actual domain IPs.
TLD nameserver: Knows which nameservers are authoritative for each second-level domain under that TLD. Does not know actual IPs either.
Authoritative nameserver: The final authority for your domain. This server has the actual DNS records (A, AAAA, CNAME, MX, etc.) and returns definitive answers.
DNS Resolution: The Complete Journey
Click each step to explore
In the worst case (cold cache), this process requires four network round-trips: stub to recursive, recursive to root, recursive to TLD, recursive to authoritative. In practice, recursive resolvers cache root and TLD nameserver addresses aggressively, so most queries only require one or two round-trips.
# Measure the full resolution time (cold cache)
dig devopsbeast.com A +stats | grep "Query time"
# ;; Query time: 47 msec
# Query again immediately (warm cache)
dig devopsbeast.com A +stats | grep "Query time"
# ;; Query time: 1 msec <-- Served from cache
Use dig +trace to see each step of the resolution chain. This is invaluable when debugging DNS because it shows you exactly where the chain breaks. If the root and TLD respond but the authoritative server does not, you know the problem is at your DNS provider. If the TLD returns wrong NS records, your domain registration is misconfigured.
TTL: Time to Live
Every DNS record has a TTL (Time to Live) — a number in seconds that tells resolvers how long to cache the answer. When the TTL expires, the resolver must query the authoritative server again.
# Check the TTL of a record
dig devopsbeast.com A
# ;; ANSWER SECTION:
# devopsbeast.com. 300 IN A 104.21.45.67
# ^^^
# TTL = 300 seconds (5 minutes)
TTL is a trade-off:
| TTL | Caching | Propagation | DNS Load |
|---|---|---|---|
| 60s (1 min) | Minimal — frequent queries | Fast — changes visible in 1 min | High — resolver queries often |
| 300s (5 min) | Good — reduces DNS load | Moderate — 5 min to propagate | Moderate |
| 3600s (1 hour) | Excellent — very few queries | Slow — 1 hour to propagate | Low |
| 86400s (24 hours) | Maximum — almost never queries | Very slow — 24 hours to propagate | Minimal |
Before making a DNS change (migrating to a new IP, changing providers), always lower the TTL first. If your TTL is 86400 (24 hours) and you change the A record, some users will still see the old IP for up to 24 hours. Lower the TTL to 60 seconds a day before the change, wait for the old TTL to expire, then make the change. After the change propagates, raise the TTL back.
The TTL Countdown
TTL is not static. It counts down from the moment the resolver caches the record. If you query a record with TTL 300 and then query again 100 seconds later, the resolver returns it with TTL 200 (the remaining time).
# Query and note the TTL
dig devopsbeast.com A +noall +answer
# devopsbeast.com. 300 IN A 104.21.45.67
# Wait 60 seconds, query again
dig devopsbeast.com A +noall +answer
# devopsbeast.com. 240 IN A 104.21.45.67
# ^^^ TTL decreased by 60
Caching Layers
DNS answers are cached at multiple levels. Understanding these layers is critical when debugging "why is my DNS change not taking effect?"
DNS Caching Layers — From Closest to Farthest
Chrome, Firefox, and Safari all maintain their own DNS caches. Chrome: check at chrome://net-internals/#dns. TTL: usually respects the record TTL, capped at a browser-specific maximum. Clear: close and reopen the browser.
macOS (mDNSResponder), Windows (DNS Client service), Linux (systemd-resolved or nscd). Each caches DNS responses. Clear: sudo dscacheutil -flushcache (macOS), ipconfig /flushdns (Windows), systemd-resolve --flush-caches (Linux).
Home routers, corporate DNS proxies, and VPN DNS servers often cache responses. These are outside your direct control. Clear: restart the router or wait for TTL expiry.
Your configured resolver (1.1.1.1, 8.8.8.8, ISP resolver) caches responses according to TTL. This is the cache that matters most for propagation. Clear: you cannot clear it — wait for TTL expiry.
The source of truth. When all caches expire, this is where the resolver gets the definitive answer. If the record is wrong here, everything downstream is wrong.
Hover to expand each layer
We once changed a CNAME record and waited the full TTL for propagation. The new record worked for everyone except one team, who kept hitting the old IP. After an hour of debugging, we discovered their corporate VPN was running a DNS proxy that cached responses for 24 hours regardless of TTL. The fix was to flush the VPN DNS proxy cache manually. Lesson: you do not control all the caching layers between your users and your authoritative server.
Negative Caching: When NXDOMAIN Gets Stuck
When a domain does not exist, the authoritative server responds with NXDOMAIN (Non-Existent Domain). This response is also cached, according to the SOA record's minimum TTL (also called the negative TTL).
# Query a domain that does not exist
dig nonexistent.devopsbeast.com A
# ;; AUTHORITY SECTION:
# devopsbeast.com. 1800 IN SOA ns1.cloudflare.com. dns.cloudflare.com. ...
# ^^^^
# Negative cache TTL = 1800 seconds (30 minutes)
This means if you create a new DNS record for api.devopsbeast.com, anyone who queried it before it existed (and got NXDOMAIN) will continue getting NXDOMAIN for up to 30 minutes, even though the record now exists.
Negative caching is one of the most frustrating DNS behaviors. You create a record, test it, and it works. But users report NXDOMAIN. The cause: they queried the domain before the record existed, and their resolver cached the NXDOMAIN response. The fix: wait for the negative TTL to expire. Prevention: create DNS records before pointing traffic to them, not after.
DNS Transport: UDP, TCP, DoH, DoT
DNS has evolved beyond simple UDP queries:
UDP (port 53): The original and most common transport. Fast because no handshake is required. Limited to 512 bytes per response (or 4096 with EDNS0). If the response is too large, the server sets the TC (truncated) flag and the client retries over TCP.
TCP (port 53): Used for large responses (zone transfers, DNSSEC-signed responses with many records). Also used when UDP is blocked. Adds one RTT for the TCP handshake.
DNS over HTTPS (DoH, port 443): DNS queries wrapped in HTTPS. Provides privacy (your ISP cannot see your DNS queries). Used by browsers (Firefox, Chrome). Adds TLS overhead but leverages existing HTTPS infrastructure.
DNS over TLS (DoT, port 853): DNS queries wrapped in TLS. Similar privacy benefits to DoH but uses a dedicated port. Used by Android, systemd-resolved.
# Standard UDP query
dig @1.1.1.1 devopsbeast.com A
# Force TCP
dig @1.1.1.1 devopsbeast.com A +tcp
# DNS over HTTPS (using curl)
curl -s -H "accept: application/dns-json" \
"https://1.1.1.1/dns-query?name=devopsbeast.com&type=A"
# {"Status":0,"Answer":[{"name":"devopsbeast.com","type":1,"TTL":300,"data":"104.21.45.67"}]}
In Kubernetes clusters, DNS is always plain UDP on port 53 between pods and CoreDNS. If you need encrypted DNS for external resolution, configure CoreDNS to forward to an upstream resolver over TLS using the forward plugin with the tls:// prefix. This encrypts DNS queries leaving the cluster while keeping internal DNS fast.
DNS Failures and What They Mean
When DNS goes wrong, the error messages are specific and meaningful:
| Response | Meaning | Common Cause |
|---|---|---|
| NOERROR + answer | Success | Everything works |
| NOERROR + empty | Domain exists but no records of that type | Querying A when only CNAME exists |
| NXDOMAIN | Domain does not exist | Typo, deleted record, wrong zone |
| SERVFAIL | Resolver could not reach authoritative server | Network issue, authoritative server down, DNSSEC failure |
| REFUSED | Resolver rejected the query | Resolver does not serve that zone, access control |
| Timeout | No response at all | Firewall blocking UDP 53, resolver unreachable, network down |
# Check the response status
dig devopsbeast.com A +noall +comments
# ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345
# ^^^^^^^^
# This tells you what happened
# SERVFAIL example
dig @10.96.0.10 failing-domain.com A +noall +comments
# ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 67890
SERVFAIL is the hardest DNS error to debug because it is a generic failure. It can mean: (1) the authoritative server is unreachable from the resolver, (2) DNSSEC validation failed, (3) the authoritative server returned an invalid response, or (4) the resolver itself is broken. Always check with multiple resolvers (dig @8.8.8.8, dig @1.1.1.1) to isolate whether the problem is your resolver or the authoritative server.
Key Concepts Summary
- DNS is a hierarchical, distributed database — root servers delegate to TLD servers, which delegate to authoritative servers
- Recursive resolvers do the heavy lifting — they walk the hierarchy and cache results for future queries
- TTL controls caching duration — lower TTL means faster propagation but more DNS load
- Negative caching (NXDOMAIN) causes the most confusion — a cached "does not exist" response persists even after the record is created
- DNS uses UDP port 53 by default — TCP is used for large responses, DoH/DoT for privacy
- Caching happens at five layers: browser, OS, local network, recursive resolver, and authoritative server — you only control the last one
- NXDOMAIN means the domain does not exist, SERVFAIL means the resolver could not reach the authoritative server, and timeout means something is blocking DNS entirely
- Always lower TTL before making DNS changes — drop to 60s, wait for old TTL to expire, then make the change
Common Mistakes
- Making a DNS change without lowering the TTL first — users see stale records for hours
- Forgetting about negative caching — querying a domain before creating the record poisons caches with NXDOMAIN
- Testing DNS changes only from your machine — your cache may be warm while everyone else still has the old answer
- Assuming SERVFAIL means the domain does not exist — it means the resolver could not complete the query, which is a different problem entirely
- Not checking the authoritative server directly when debugging — always verify with
dig @ns1.your-provider.com domain.com Ato see the source of truth - Confusing TTL in the answer section (remaining cache time) with TTL at the authoritative server (original TTL)
You create a new A record for api.devopsbeast.com. A colleague tests it and gets NXDOMAIN, even though dig against the authoritative server returns the correct IP. What is the most likely cause?