All posts
Kubernetes Networking

Why Every Kubernetes Cluster Makes 5 DNS Queries For One Lookup

ndots:5 is the silent latency killer in Kubernetes. Every external hostname resolution generates four wasted queries before the right one. Here is why, and how to fix it.

By Sharon Sahadevan··9 min read

A pod resolves api.stripe.com. That is one hostname. How many DNS queries does the pod actually send?

Run this in a busy production pod and watch:

kubectl exec -it $POD -- /bin/sh
$ apt-get install dnsutils -y
$ dig +noall +answer +stats api.stripe.com 2>&1 | tail

Or capture with tcpdump on the node:

api.stripe.com.payments.svc.cluster.local. → NXDOMAIN
api.stripe.com.svc.cluster.local.          → NXDOMAIN
api.stripe.com.cluster.local.              → NXDOMAIN
api.stripe.com.us-east-1.compute.internal. → NXDOMAIN
api.stripe.com.                            → A 54.187.205.xxx

Five queries. Four of them returned NXDOMAIN before the fifth one finally hit the actual record. Every external hostname your pods resolve does this. Every single one.

This is the famous ndots:5 behavior, and it is one of the most consequential network defaults in Kubernetes. Understanding it is the difference between a cluster that handles 50,000 DNS queries per second and a cluster that runs out of CoreDNS capacity at 10,000.

Why this happens: the search list#

Look at the /etc/resolv.conf inside any pod:

kubectl exec -it $POD -- cat /etc/resolv.conf

You will see something like:

nameserver 10.96.0.10
search payments.svc.cluster.local svc.cluster.local cluster.local us-east-1.compute.internal
options ndots:5

Three things matter here:

nameserver 10.96.0.10: the cluster IP of the kube-dns Service (CoreDNS, in modern clusters). Every DNS query the pod makes goes here first.

search ...: a list of suffixes the resolver appends to "short" hostnames before querying. For a pod in the payments namespace, the search list is:

  1. payments.svc.cluster.local (own namespace)
  2. svc.cluster.local (any namespace)
  3. cluster.local (cluster-wide names)
  4. us-east-1.compute.internal (cloud-provided, typically the node's domain)

options ndots:5: the rule for when to apply the search list. The number is the threshold: any name with fewer dots than this gets the search list applied first; any name with this many dots or more gets queried as-is first.

Putting these together: when you resolve api.stripe.com (2 dots), the resolver thinks "fewer than 5 dots, so try the search list first." It tries each suffix in order, gets NXDOMAIN on each, and finally tries the bare name.

That is the four wasted queries.

Why the default is 5#

The ndots:5 setting was chosen so that internal Kubernetes service names (which look like service.namespace.svc.cluster.local, four dots) are short enough to trigger search expansion. In other words, when you write service.namespace in your code, the resolver should know to try service.namespace.svc.cluster.local before falling back to bare lookup.

For internal-facing workloads, this is fine. For workloads that mostly call external services (payment APIs, third-party SaaS, cloud APIs), this is terrible. Every external call wastes four queries.

The cost at scale#

Multiply this out:

  • A pod that calls 10 external APIs per request, at 100 requests per second, generates 1,000 external hostname resolutions per second.
  • With ndots:5, that is 5,000 actual DNS queries per second from one pod.
  • Across 100 such pods, that is 500,000 queries per second hitting CoreDNS.

CoreDNS is fast (tens of thousands of queries per second per replica), but it is not infinite. At some scale, you are paying real CPU and pod-replica budget for queries that mostly return NXDOMAIN.

The other cost is latency. Each NXDOMAIN takes a round trip to CoreDNS. Even if CoreDNS responds in 1ms, four of them is 4ms before the resolution starts. For a workload that handles thousands of external requests per second, this overhead is real.

How to see your own DNS amplification#

Three observations to make:

1. CoreDNS query rate by response code. The ratio of NXDOMAIN to NOERROR tells you how much of your DNS traffic is wasted lookups.

sum(rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m]))
/
sum(rate(coredns_dns_responses_total[5m]))

If this is above 0.5 (more than half of responses are NXDOMAIN), you have heavy ndots amplification.

2. Per-pod tcpdump. The unambiguous truth: just look at what queries are being sent.

# On the node hosting the pod, find the pod's veth interface and tcpdump
kubectl debug -it node/$NODE --image=nicolaka/netshoot
$ tcpdump -i any -n 'port 53' | head -50

You will see the search-suffix queries and the final bare-name query for each external resolution.

3. CoreDNS slow query logs. Enable the log plugin in CoreDNS and look at queries by name pattern. NXDOMAIN responses for *.svc.cluster.local patterns from external hostnames are the smoking gun.

The three fixes#

There are three options, in increasing order of cluster-wide impact.

Fix 1: FQDN with trailing dot (per call)#

The simplest: when you know a hostname is external, append a trailing dot to make it a fully qualified domain name. The resolver sees the trailing dot and skips the search list entirely.

# Python: requests with FQDN
import requests
r = requests.get("https://api.stripe.com./v1/charges")  # note the dot
// Go: explicit FQDN
resp, err := http.Get("https://api.stripe.com./v1/charges")

This works but is fragile: developers forget; libraries normalize away the trailing dot; it does not help with hostnames coming from configuration. Useful as a tactical patch, not a strategic fix.

Fix 2: Per-pod ndots override#

Set dnsConfig.options in the pod spec to override ndots:

apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "1"

ndots:1 means "if the name has at least 1 dot, query it as-is first; only fall back to the search list if that fails." For api.stripe.com (2 dots), the resolver queries api.stripe.com directly first. One query, one answer.

For internal lookups, this also still works: myservice.mynamespace (1 dot) gets queried as-is first, fails (NXDOMAIN), then falls back to the search list and resolves correctly. You add one wasted query for internal names but save four wasted queries for external names. For workloads heavy on external calls, the trade is dramatically positive.

The downside: every workload needs this config set. With 100 deployments, that is 100 places to change. A mutating webhook (Kyverno, OPA Gatekeeper) can apply it cluster-wide, but that is its own engineering work.

Fix 3: NodeLocal DNSCache#

NodeLocal DNSCache runs a DNS cache as a DaemonSet on every node. Pod queries go to a local cache first, which then forwards to CoreDNS only on miss. The cache's resolv.conf can have its own ndots settings, and most importantly, repeated queries for the same name are answered from local memory.

# Simplified install (full YAML in the K8s docs)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: node-local-dns
  template:
    spec:
      hostNetwork: true
      containers:
        - name: node-cache
          image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1
          args:
            - -localip
            - 169.254.20.10
            - -conf
            - /etc/Corefile
          # ... volume mounts and config

Pods configured to use 169.254.20.10 as their primary DNS resolver hit local cache; cache misses go to CoreDNS. The win:

  • Repeated queries are nearly free (in-memory cache hit on the same node).
  • Even with ndots:5, the four wasted queries are answered from cache after the first time.
  • CoreDNS load drops dramatically.

This is the production answer for clusters at scale. Combined with ndots:1 per pod, it is even better.

Cilium and Cilium Local Redirect Policy (modern alternative)#

If your CNI is Cilium, there is a fourth option: LocalRedirectPolicy redirects pod DNS traffic to a local node-level DNS resolver via eBPF, without changing the pod's resolv.conf. The local resolver can have its own caching and ndots policy. Same idea as NodeLocal DNSCache, less configuration.

What about CoreDNS itself?#

CoreDNS is not the cause of ndots amplification, but it is the load-bearing component. A few CoreDNS-side levers worth knowing:

Increase replicas based on load. The default 2 replicas is fine for small clusters. For high DNS traffic, scale to 5+ and use a HorizontalPodAutoscaler driven by CPU or by the coredns_dns_request_count_total rate.

Tune the cache plugin. CoreDNS has a built-in cache; the default TTL is 30s. Bumping it for stable external names (5 minutes for *.amazonaws.com, etc.) reduces upstream queries. Be careful with TTLs that mask real DNS changes.

Enable autopath plugin. The autopath plugin in CoreDNS short-circuits the search-suffix walk: it knows about the cluster's search domains and answers all four NXDOMAINs at once when it sees the bare-name query, instead of letting the client walk the suffix list. This trades some CoreDNS state for fewer round trips.

# In CoreDNS Corefile
.:53 {
    autopath @kubernetes
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods verified
    }
    cache 30
    forward . /etc/resolv.conf
}

Note: autopath requires pods verified mode, which has some overhead.

A common false alarm: high CoreDNS query rate#

A team sees CoreDNS at 10K QPS and thinks they have a problem. They might. They might also be looking at normal traffic from a healthy ndots:5 cluster making lots of external calls. The right question is not "is my CoreDNS query rate high" but "what is the NXDOMAIN ratio and what is my p99 DNS resolution latency." Those tell you whether the rate is hurting or just noisy.

Quick reference: the DNS amplification checklist#

1. Measure NXDOMAIN ratio:
   sum(rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m]))
   / sum(rate(coredns_dns_responses_total[5m]))
   (above 0.5 = ndots amplification)

2. Capture queries from a representative pod:
   tcpdump -i any -n 'port 53' on the node
   (look for the .svc.cluster.local NXDOMAINs)

3. Pick the right fix:
   - One workload, mostly external: dnsConfig ndots:1 on that pod
   - Many workloads: mutating webhook to apply ndots:1 cluster-wide
   - High DNS traffic at scale: NodeLocal DNSCache (or Cilium LocalRedirect)
   - Mostly internal traffic: leave ndots:5; tune CoreDNS cache and replicas

4. Re-measure NXDOMAIN ratio after change.

5. Set up an SLO on DNS latency p99 (target: under 5ms intracluster).

The mental model#

ndots:5 exists because Kubernetes service names are deeply qualified. It costs you one round trip per external name. At small scale, the cost is invisible. At large scale, it is one of the largest sources of avoidable latency and CoreDNS load in your cluster.

The default is reasonable for clusters that are mostly internal. For clusters that talk a lot to external APIs, change the default. Either per-pod (dnsConfig), cluster-wide (mutating webhook), or by adding NodeLocal DNSCache. All three are well-trodden paths.

Once you know to look for it, you will see the amplification on every cluster you inherit. Now you know how to fix it.


This kind of "default that bites at scale" is the bread and butter of the Kubernetes Performance Optimization course, where we cover the network, storage, and CPU defaults that quietly cost you at production volume. And the Networking Fundamentals course covers DNS, TCP, and Kubernetes-specific networking from first principles.