Kubernetes Architecture & Chaos

The Control Plane and Data Plane Split

The single most useful sentence in Kubernetes is: the control plane decides what should run; the data plane runs it. Almost every architectural question, every debugging session, and every interview answer starts here. If you cannot place a component on one side or the other of that line, you cannot reason about what happens when it fails.

This lesson is the mental model. The whole rest of the course depends on it being clean in your head: apiserver, etcd, scheduler, controller-manager on one side; kubelet, kube-proxy, CNI dataplane, and the container runtime on the other. The split is not an implementation detail. It is the design.

KEY CONCEPT

Control plane components decide. Data plane components do. When the control plane is down, existing pods keep running because the data plane has no dependency on the control plane being available, only on the state already pushed to it. This is the property that makes Kubernetes survivable.

The two halves at a glance

The asymmetry is intentional. The control plane has four (sometimes five) discrete components, all of them stateless except etcd. The data plane has fewer components per node, but there is one of each on every worker node: kubelet, runtime, kube-proxy, CNI dataplane, CSI node plugin, running at the scale of the whole cluster.

What each control plane component actually does

A short tour, in the order requests flow through them:

kube-apiserver

The only thing that talks to etcd. Everything else: controllers, scheduler, kubelet, kubectl, you, your CI pipeline, talks to the apiserver, never to etcd directly. The apiserver is:

A REST front-end. Each Kubernetes resource (Pod, Service, etc.) is a URL.
An admission and validation pipeline. Requests pass through authentication, authorization, mutating admission webhooks, schema validation, and validating admission webhooks.
The owner of the watch cache. Every controller and every kubelet maintains a long-lived watch connection to the apiserver to learn about changes.
Stateless. You can run several apiserver replicas behind a load balancer; they all point at the same etcd cluster.

Because every other component depends on the apiserver, "apiserver is down" is the canonical control plane outage scenario. Note that "apiserver is down" does not mean "the cluster stops running", covered in Module 8.

etcd

A distributed key-value store with strong consistency, backed by the Raft consensus algorithm (Module 3). Everything Kubernetes considers "the cluster state" lives here:

Every Pod, Service, ConfigMap, Secret, Deployment.
Every CRD instance.
The lease objects that components use for leader election.
Cluster events, until they expire.

etcd is the only stateful piece in a Kubernetes control plane. Lose etcd without a backup and you lose the cluster, there is no second source of truth. Even running pods are running because etcd told the kubelet they should be there.

The apiserver is the only client that talks to etcd. This is on purpose: it lets the apiserver enforce schema, authorization, and admission rules consistently. Bypassing the apiserver to write to etcd directly is exactly how you corrupt a cluster.

kube-scheduler

A controller, but special enough to call out. The scheduler watches for Pods with no spec.nodeName set and decides which node each one should run on. Once it picks, it patches the Pod spec to set nodeName. That is its entire job.

Notice what the scheduler does not do: it does not start the pod, it does not pull images, it does not check whether the pod is healthy. It picks a node and walks away. The kubelet on the chosen node sees the pod via watch and takes it from there.

This crisp separation is what lets you replace the scheduler with your own (or run multiple schedulers). It is also what makes "the scheduler is broken" a much smaller problem than it sounds, existing pods keep running fine; only new pod placement stops.

kube-controller-manager

A single binary running many built-in controllers. Each one watches a kind of resource and reconciles it toward the desired state:

The Deployment controller creates ReplicaSets.
The ReplicaSet controller creates Pods.
The Endpoints/EndpointSlice controller maintains the list of Pod IPs backing each Service.
The Node controller marks Nodes NotReady when their kubelet stops checking in.
The Job, CronJob, StatefulSet, DaemonSet, and Garbage Collection controllers each have their own loop.

These are the "things actually happen" controllers. When you kubectl apply a Deployment, the apiserver writes it to etcd; the Deployment controller sees it and creates a ReplicaSet; the ReplicaSet controller sees that and creates Pods; the scheduler sees those and assigns nodes; finally the kubelet on each chosen node creates the containers.

cloud-controller-manager

The cloud-specific glue. Provisions LoadBalancer Services into ALBs/NLBs/GCLBs, attaches PersistentVolumes to nodes, removes Node objects when the underlying VM is terminated. Spun out of kube-controller-manager so that cloud-specific logic does not live in core Kubernetes.

If you run on bare metal or you self-host, you may not have this component. If you run on EKS/GKE/AKS, the cloud manages it for you.

What each data plane component actually does

The per-node side. One of each (or more, in the case of CNI) on every worker node.

kubelet

The single agent on each node that talks to the apiserver. Its loop:

Watch for Pods assigned to this node (via spec.nodeName).
For each Pod that should be here but is not running, ask the container runtime (via CRI) to create the containers.
For each Pod that is running but should not be, ask the runtime to stop it.
Report the status of every Pod (Pending, Running, Succeeded, Failed) and the Node itself back to the apiserver.

The kubelet is not in the data path of any user traffic. It is a control loop that ensures the node's actual state matches the desired state expressed in etcd (via the apiserver).

Container runtime (containerd, CRI-O)

The thing that actually runs the containers. The kubelet talks to it via the Container Runtime Interface (CRI), a gRPC API. The runtime:

Pulls images from registries.
Sets up cgroups, namespaces, mounts.
Calls runc (or another OCI runtime) to start the container process.
Streams logs back to the kubelet.

Module 6 goes deep on the runtime layers. For now: containerd is the runtime; runc is one layer below; the OCI spec is what they all agree on.

kube-proxy

Watches Services and Endpoints and programs the node's iptables, IPVS, or nftables rules so that traffic to a Service ClusterIP is rewritten to one of the backing Pod IPs. kube-proxy is not in the packet path; the kernel does the rewriting based on the rules kube-proxy has installed.

Crucial implication: kube-proxy can crash and traffic keeps flowing because the rules are still in the kernel. New endpoints will not be reflected until kube-proxy comes back, but existing connections are unaffected.

CNI dataplane (Calico, Cilium, Flannel, ...)

Two parts: the CNI binary that runs at pod-create time to set up the pod's network namespace and IP, and the dataplane that handles routing and NetworkPolicy enforcement. The dataplane lives in the kernel (iptables / nftables / eBPF) or in a per-node agent.

The CNI dataplane is in the packet path. If it breaks, pods cannot reach each other or the network at all. This is one of the few "data plane is broken" scenarios that genuinely takes user traffic down.

CSI node plugin

Mounts persistent volumes into pods at pod-create time, unmounts at pod-stop. The CSI controller plugin (which provisions volumes) runs as a Deployment in the control plane namespace; the CSI node plugin runs as a DaemonSet on every node. They talk to the same cloud API but do different parts of the volume lifecycle.

Why the split matters operationally

The split is not academic. It directly answers questions you will be asked at 3 AM:

"The apiserver is unreachable. Is everything down?"

No. The control plane is unreachable. The data plane keeps doing what it was already doing. Existing pods keep running. kube-proxy keeps routing existing services. CNI keeps forwarding packets. The only things broken are: new pods cannot be scheduled, controllers cannot reconcile, kubectl does not work.

This is the fundamental survivability property. Kubernetes degrades; it does not fall off a cliff.

"etcd is corrupted. What is recoverable?"

Without etcd, the apiserver cannot serve reads or writes. New control plane operations stop. But: the kubelets still know what they should be running (they cached it), and the data plane components still do their job. You have time, usually hours, to restore etcd from backup before things start to drift.

What is not recoverable from data plane alone: any cluster state not yet observed by the affected component. A new Deployment created seconds before etcd died might not be running anywhere yet.

"kubelet is crashlooping on one node. What is broken?"

The control plane and the rest of the data plane are fine. Pods on that one node are not being maintained, if any crash, they will not be restarted. New pods cannot be scheduled to it (the Node will go NotReady). Other nodes are unaffected.

"kube-proxy is down on one node."

Existing iptables rules still work. New Service endpoints are not reflected. Traffic to Services already in use keeps flowing. Traffic to Services with new pods may not see those new pods until kube-proxy comes back. Not a five-alarm fire; a slow problem.

"CNI dataplane is broken on one node."

Pods on that node cannot reach the network. This is a user-impacting outage, narrow to that node. Cordon the node and drain it; the workloads reschedule elsewhere.

The pattern: control plane outages stop change; data plane outages stop existing function. These have different urgencies and different mitigation paths.

Why the split matters for design

The architectural patterns flow from this split:

The control plane can be highly available with cheap stateless replicas. Three apiserver replicas behind an LB plus a 3- or 5-member etcd cluster covers most clusters.
The data plane scales horizontally with the cluster. Adding nodes does not require adding control plane capacity proportionally, you might double from 50 to 100 nodes without touching the control plane.
The data plane components are designed to tolerate apiserver unavailability. Watches reconnect; cached state is used until refreshed. This is deliberate.
The control plane is the bottleneck for change rate, not the data plane. When you scale a Deployment from 100 to 10,000 replicas, the apiserver's QPS and the controller-manager's reconcile rate are what limit you.

The design decision behind all of this: make the data plane dumb and the control plane smart. The kubelet does almost no decision-making; it implements decisions that came from the control plane via the apiserver. The control plane does almost no execution; it just persists desired state and lets the data plane converge.

PRO TIP

A useful test for any new Kubernetes component (operator, controller, custom scheduler): which side of the split is it on? If it makes decisions about cluster state, it belongs in the control plane and should follow the control-plane patterns (watch the API, write back via the API, never touch nodes directly). If it implements decisions on a node, it belongs in the data plane and should follow the data-plane patterns (DaemonSet, kubelet-style reconcile loop, no direct etcd access). Components that try to be both end up brittle.

Where the boundaries blur

In practice, two things make the clean diagram messier:

Static pods

The kubelet supports a "static pod" mode where it watches a directory on disk for pod manifests and runs whatever it finds, with no apiserver involvement. This is how kubeadm bootstraps the control plane itself: the apiserver, controller-manager, scheduler, and etcd run as static pods on the control plane nodes, managed by kubelets that need no other Kubernetes component.

This is a useful escape hatch but blurs the data plane / control plane boundary. The kubelet is data plane, but the static pods running the control plane are obviously control plane. In production this only matters for control-plane bootstrapping and recovery scenarios.

CSI controller plugin

The CSI controller (which provisions PersistentVolumes) is technically control-plane logic but runs as a Deployment in the cluster, not as part of kube-controller-manager. So is the cluster autoscaler, Karpenter, cert-manager, Argo CD, and many other operators.

These are control-plane-style components (they make decisions, they reconcile) that live as workloads inside the cluster they manage. They eat their own dog food. Operationally they get treated like data-plane workloads (you upgrade them, you scale them, they have node requirements), but architecturally they are control-plane reasoning.

Service mesh sidecars

Istio, Linkerd, Cilium service mesh: these inject a proxy sidecar (Envoy or similar) into application pods. The sidecar is in the application data path, clearly data plane. The control plane of the mesh (istiod, linkerd-controller) runs as a Deployment, eats its own dog food the same way.

The mesh has its own control plane and its own data plane, riding on top of Kubernetes' control plane and data plane. Layers of layers, but the same split applies recursively.

Summary

The control plane decides; the data plane does. Four (or five) components on the control plane side: apiserver, etcd, scheduler, controller-manager, sometimes cloud-controller-manager. Per-node components on the data plane side: kubelet, runtime, kube-proxy, CNI dataplane, CSI node plugin.

The split explains the survivability properties of Kubernetes. The control plane can be restarted, replaced, or temporarily down without taking the cluster's running workloads with it. The data plane runs without depending on the control plane being instantaneously available, only on the state already cached.

Internalize this split. Almost every architectural question in the rest of this course (and in interviews) starts from "where on this diagram does this component sit?"

The next lesson goes one level deeper into the apiserver's role, why everything in Kubernetes routes through it, and why that single design choice is what makes the system extensible.

Continue

The API Server as the Universal Bus

←→ navigateM toggle sidebar