All posts
Kubernetes Architecture

"Walk Me Through What Happens When You Create a Pod." It's Also How You Debug One.

The canonical senior Kubernetes interview question has a twelve-step answer, from kubectl apply to a Ready pod. The same twelve steps are the map you walk backward every time a pod is stuck. Learn the chain once and you get the interview answer and the debugging flow for free.

By Sharon Sahadevan··10 min read

"Walk me through what happens when you create a pod."

It is the most common senior Kubernetes interview question, and the answer separates people cleanly. The junior answer is two steps: "kubectl apply, then the pod runs." The senior answer is twelve, and each step is a named component doing one specific thing with one specific way it can fail.

Here is the part most people miss: those twelve steps are not interview trivia. They are the exact map you walk, in reverse, every time a pod is stuck. kubectl describe pod tells you where in the chain your pod got stranded; knowing the chain tells you why. Learn it once, get the interview answer and the production debugging flow in the same trip.

So let's walk it, from kubectl apply to a pod that is Ready and serving traffic, and name the failure mode hiding at each step.

The twelve steps, in one breath#

  1. kubectl POSTs your YAML to the apiserver over TLS.
  2. apiserver authenticates the caller.
  3. apiserver authorizes the verb (RBAC).
  4. mutating admission runs (defaulting, sidecar injection).
  5. schema validation, then validating admission (policy).
  6. apiserver writes the object to etcd (Raft commit).
  7. the Deployment controller creates a ReplicaSet.
  8. the ReplicaSet controller creates Pod objects, with no node assigned.
  9. the scheduler picks a node and binds the pod to it.
  10. the kubelet on that node sees the pod and starts syncing.
  11. the runtime builds the sandbox (CNI wires up networking), then pulls the image and starts the container.
  12. probes pass, the pod goes Ready, its IP lands in the Service's EndpointSlice, and kube-proxy routes traffic to it.

Now the detail, grouped into the four phases that matter.

Phase 1: the apiserver chain (steps 1 to 6)#

Everything starts as an HTTPS request. kubectl is a thin REST client: it reads your YAML, opens a TLS connection to the apiserver, and POSTs the manifest as JSON. The certificate in your kubeconfig is what authenticates you.

Inside the apiserver, the request runs a fixed gauntlet before it is allowed to persist anything:

  • Authentication. Who is this? The apiserver tries each configured authenticator: client cert (CN becomes the username), ServiceAccount bearer token (a signed JWT), OIDC, or a webhook. Output is a userInfo with username and groups. Failure here is a 401.
  • Authorization. Can this user do this verb on this resource in this namespace? That is RBAC. Failure is a 403, and the message names the user, verb, resource, and namespace, which is usually the whole diagnosis.
  • Mutating admission. Now the object gets rewritten. Built-in plugins default things (LimitRanger adds requests/limits), and webhooks like the Istio sidecar injector patch in containers. Each webhook returns a JSON patch. A webhook with failurePolicy: Fail that times out will reject your create outright, which is a classic "why is nothing deploying" incident.
  • Validation and validating admission. Schema validation against the OpenAPI spec (required fields, types, enums), then policy webhooks like Kyverno or OPA Gatekeeper, which can only allow or deny. Failure is a 400 or a policy rejection that names the offending rule.
  • etcd write. Only now does the object become real. The apiserver serializes it and commits it to etcd through Raft, which means a majority of etcd members must fsync the write to disk before it is acknowledged. etcd assigns a resourceVersion, the apiserver updates its watch cache, and it broadcasts a watch event to every subscribed client.
KEY CONCEPT

The apiserver is the only component that talks to etcd. Everything else, controllers, scheduler, kubelet, learns about changes by watching the apiserver, not by reading the database. That single fact is why the system scales: one etcd write fans out to thousands of watchers through the apiserver's cache, and why an etcd problem (quorum loss, mvcc: database space exceeded from a missed compaction) freezes writes for the entire cluster at once.

At the end of phase 1, a Deployment object exists in etcd. Notice what does not exist yet: no ReplicaSet, no pods, nothing running. The apiserver's job was only to validate and persist your intent.

Phase 2: the controller cascade (steps 7 to 8)#

Kubernetes turns intent into reality through controllers, each watching the apiserver and reconciling desired state toward actual state.

  • The Deployment controller sees the new Deployment, hashes the pod template, and creates a ReplicaSet for that hash. It never creates pods directly; it only manages ReplicaSets. (This indirection is what makes rolling updates work: it scales a new ReplicaSet up while scaling the old one down.)
  • The ReplicaSet controller sees the new ReplicaSet, counts pods matching its selector, finds zero, and creates the desired number of Pod objects.

Those pods are created with an empty spec.nodeName. They are Pending. Nothing has decided where they will run. If they never leave Pending and no scheduler message appears, the failure lives here: a ResourceQuota or Pod Security admission rejection can stop the ReplicaSet from creating pods at all.

Phase 3: the scheduler (step 9)#

The scheduler watches for exactly one thing: pods with no spec.nodeName. For each one it runs a scheduling cycle:

  • Filter. For every node, can this pod fit here? Resource requests, taints and tolerations, node affinity, topology constraints. Nodes that fail any filter are eliminated.
  • Score. For the survivors, how good is each one? Plugins score nodes 0 to 100 and the scores aggregate.
  • Bind. The highest-scoring node wins, and the scheduler patches pod.spec.nodeName through the apiserver.

That patch is the whole job. The scheduler does not contact the node or start anything. It writes one field. If no node survives filtering, the pod stays Pending and kubectl describe pod says something like 0/50 nodes are available: 12 Insufficient memory, 38 node(s) had untolerated taint. That sentence is the scheduler telling you exactly which filters killed which nodes, which is the fastest unstick-a-pending-pod signal in the system.

Phase 4: the kubelet and the node (steps 10 to 12)#

Now the node takes over. The kubelet on the chosen node is watching for pods whose nodeName matches itself. It sees the freshly bound pod and its sync loop compares desired state (this pod spec) against actual state (containers running here), then acts through the CRI (Container Runtime Interface).

The sandbox and networking come first, before any of your containers. The kubelet calls CRI RunPodSandbox. The runtime (containerd) creates a network namespace and invokes the CNI plugin (Calico, Cilium, and so on) via /opt/cni/bin. The CNI plugin allocates an IP, creates a veth pair, and configures routes. Then the pause container starts and holds that namespace open for the life of the pod. If this step fails you see the pod wedged in ContainerCreating with failed to setup network for sandbox, and the cause is your CNI: IP exhaustion, an agent that is down, a misconfigured subnet. The pod has not even tried to pull your image yet.

Then the containers. For each container the kubelet issues CRI PullImage, CreateContainer (overlay mount, cgroups, namespaces, capabilities), and StartContainer, at which point runc execs your entrypoint. Init containers run sequentially, app containers in parallel after them. This is where the failure modes everyone recognizes live: ImagePullBackOff (typo, missing credentials, registry down), CrashLoopBackOff (your process keeps exiting), OOMKilled (it exceeded its memory limit and the kernel killed it).

Finally, readiness and routing. The kubelet runs the probes: a startup probe gates the others for slow starters, the readiness probe decides whether the pod is Ready, the liveness probe restarts it if it wedges. When readiness passes, the pod's Ready condition flips to true. The EndpointSlice controller notices and adds the pod's IP to the matching Service's EndpointSlice, and kube-proxy on every node updates its iptables (or IPVS, or eBPF) rules to include the new endpoint. Only now does Service traffic reach your pod.

End to end, this is typically 30 to 90 seconds for an ordinary pod, longer for a fat image or a slow startup probe.

The reason this doubles as a debugging map#

Read the chain backward and it becomes a triage flow. A pod is stuck; where?

  • Stuck Pending with a scheduler message → phase 3, no node fit the filters.
  • Stuck Pending with no scheduler message → phase 2, a quota or policy blocked pod creation.
  • Stuck ContainerCreating → phase 4 sandbox, almost always CNI or a volume mount.
  • ImagePullBackOff / CrashLoopBackOff / OOMKilled → phase 4 containers.
  • Running but no traffic → phase 4 tail, readiness failing so the IP never entered the EndpointSlice.
PRO TIP

kubectl describe pod is not a wall of text, it is a position report. The Events section tells you the last step the pod completed, and the chain above tells you what comes next and what tends to break there. Debugging a stuck pod is just: find the step it stalled on, then check that one component.

The follow-ups that separate senior from staff#

Interviewers rarely stop at the walkthrough. They probe the edges, and the edges are where the architecture earns its keep:

  • "The scheduler binds the pod, then the node dies before the kubelet starts it." The kubelet on the dead node never sees the pod. The node controller marks the node NotReady, and after the eviction timeout the pod is removed. The ReplicaSet controller sees the gap and creates a replacement, which the scheduler places elsewhere with a new IP, and the EndpointSlice follows. The system is self-healing because every layer is a reconciliation loop, not a one-shot script.
  • "The image is huge and the pull takes five minutes." The pod sits in ContainerCreating the whole time, Ready is false, no traffic routes to it. The fix is upstream of Kubernetes: smaller images, a warm image cache, or pre-pulling.
  • "How is a StatefulSet different?" Stable pod names and stable PVCs, the StatefulSet controller replacing the Deployment-plus-ReplicaSet pair, and ordered sequential creation by default, with volume binding added to the timeline.

Each of these is the same twelve-step model, poked at a specific joint. Once you hold the structure, the follow-ups answer themselves.

The mental model#

Pod creation is not magic between kubectl apply and a running container. It is a pipeline of independent components, each watching the apiserver, each reconciling one thing, each with a characteristic failure. Name the twelve steps and you can answer the interview question cold, and more usefully, you can look at any stuck pod and know exactly which of the twelve places to look.

Memorize the structure. The details fall in around it.


This walkthrough is the flagship lesson of the Kubernetes Architecture & Chaos course, which goes component by component (apiserver request lifecycle, etcd Raft, the scheduler framework, the kubelet and CRI, networking internals) and then turns that architecture into chaos-engineering reasoning: how to test resilience without breaking production. The interview framing continues in Kubernetes System Design Interview Prep. Related reading for the failure modes named above: How to Debug Kubernetes OOMKilled and Kubernetes Probes Done Wrong for phase 4, kube-proxy iptables vs IPVS for how step 12 actually routes traffic, etcd Compaction and Defrag for the database space exceeded failure in step 6, and Your Kubernetes Cluster Died at 2am from an Expired Cert for when step 1's TLS handshake is the thing that breaks.