Kubernetes Architecture & Chaos

The API Server as the Universal Bus

Most distributed systems have a database, a message queue, a service registry, an audit log, and an authorization layer. Each is a separate piece of infrastructure, each with its own scaling characteristics, each with its own failure mode. Kubernetes does not have these as separate things. It has one thing, the apiserver, that does all of them. Every read, every write, every watch, every audit event, every authorization decision goes through this one component.

That sounds like a single point of failure. It is not, exactly, and understanding why is the second piece of the architectural mental model.

KEY CONCEPT

The apiserver is a database front-end, a message bus, a service registry, an audit log, and an authorization gateway, all rolled into one. Every other Kubernetes component talks only to the apiserver, never to anything else. This is not a quirk; it is the design choice that makes the platform extensible, auditable, and survivable.

What "everything goes through the apiserver" actually means

A walk through who talks to what:

Every arrow in that diagram is the same kind of arrow: a TLS gRPC or HTTPS connection to the apiserver. There is no other arrow. The kubelet does not talk to etcd. The scheduler does not talk to the kubelet. The Deployment controller does not talk to the ReplicaSet controller. They all talk to the apiserver, and the apiserver coordinates.

This sounds inefficient, why does the scheduler not just tell the kubelet directly?, and is in fact slightly less efficient than direct messaging would be. The cost is paid for several payoffs:

The five jobs the apiserver does at once

1. Authentication and authorization

Every request authenticates first. The apiserver supports OIDC, client certs, service account tokens, webhook auth, and anonymous (off by default). After auth, every request is authorized: RBAC, ABAC, or webhook authorizer.

This is centralized for a reason. If kubelets talked directly to etcd, etcd would need its own authentication layer; if the scheduler talked directly to kubelets, the kubelets would need their own. Centralizing auth at the apiserver means one place to define who can do what, and that place is the source of truth for cluster operations.

This is also why bypassing the apiserver to write to etcd directly is so dangerous: you skip the entire auth layer.

2. Validation and admission

Before a request reaches storage, the apiserver runs it through admission. Built-in admission plugins enforce things like ResourceQuota and LimitRange. Webhook admission plugins (cert-manager, Kyverno, OPA Gatekeeper) enforce custom policies.

This is the consistency story: every write to cluster state passes through the same gate. Schema validation, mutating webhooks, validating webhooks. There is no path to add objects to etcd that skips this, because etcd does not have its own admission. The apiserver is the only writer.

Module 2 covers admission in depth.

3. Storage abstraction

The apiserver maps REST endpoints to etcd keys. /api/v1/namespaces/default/pods/checkout-api-xyz becomes a key /registry/pods/default/checkout-api-xyz in etcd. The apiserver handles serialization (protobuf or JSON), versioning (v1 vs v1beta1 conversion), and the watch cache.

This abstraction is what lets Kubernetes change its storage implementation without changing every client. In theory you could run Kubernetes on something other than etcd (k3s ships with sqlite as an option). The clients would not need to change because they talk to the apiserver, not to the storage.

4. Watch streams (the bus part)

The most underappreciated apiserver feature: the watch endpoint. Any client can open a long-lived HTTPS connection and receive a stream of events as resources change. This is how:

The scheduler learns about new unscheduled Pods.
The kubelet learns about Pods assigned to its node.
The Deployment controller learns about Pod status changes.
An operator learns about CRD instance changes.
Argo CD learns about drift from the desired state.

The watch stream is the message bus. There is no Kafka, no RabbitMQ, no Redis pub/sub in core Kubernetes. The apiserver's watch implementation is the eventing layer.

What this means in practice:

Watch reconnects are cheap and expected. Clients use a resourceVersion cursor to resume from where they left off.
The apiserver maintains a watch cache so multiple watchers do not all hit etcd. A 10,000-node cluster does not have 10,000 etcd watch connections, it has one or two etcd watches feeding the apiserver's cache, which fans out to thousands of client watches.
Watch is push, not poll. Controllers do not poll the apiserver every second; they hold a watch open and react when events arrive.

This is why "the apiserver is the universal bus." Every event of interest in the cluster flows through it.

5. Audit log

Every request to the apiserver can be logged with full request and response bodies (Module 7 of Production Kubernetes Operations covers this). This is the cluster's authoritative log of who did what.

Centralization matters here too. If components could bypass the apiserver, they could bypass the audit log. Because they cannot, the audit log is complete.

The implications

This single-bus design has consequences worth internalizing.

Extensibility is the killer feature

You can add a new resource type, a CRD, and the apiserver becomes the API for it. You can add a new admission webhook, a policy. You can add a new aggregated API server, for example, the metrics API. You can add a new controller, your operator.

In every case, the new piece talks to the apiserver, the apiserver fits it into the existing auth/admission/storage/watch flow, and you get authentication, authorization, audit logging, and event streams for free. You do not implement those layers yourself; the apiserver provides them.

This is why Kubernetes ate the world. Not because the orchestrator is uniquely good (it has flaws), but because the platform is uniquely extensible. Every third-party tool plugs into the same bus.

Observability is built in

The apiserver knows everything. Every action is auditable. Every state change is observable. Every authorization decision is logged. Compare to a system where you have separate database, queue, and registry components: tracing a single user's actions across them is a research project. In Kubernetes it is one log query.

Scaling is one-dimensional

Because every component routes through the apiserver, scaling Kubernetes is mostly about scaling the apiserver. There are sub-problems (etcd throughput, scheduler throughput, controller-manager throughput) but the apiserver is the central one.

This is also why apiserver performance work matters disproportionately. A 30% improvement in apiserver QPS unlocks a 30% larger cluster (roughly). A 10% improvement in scheduler latency only matters at scheduler-bound workloads, which most clusters are not.

The single point that is not failure

"The apiserver is a single point of failure" is wrong, but only just. The apiserver is stateless, all state lives in etcd. You can run three apiserver replicas behind a load balancer. Lose any one (or two), the others keep serving. The actual single point is etcd, and etcd is itself a 3- or 5-member cluster with quorum semantics.

So: there is no single component whose failure takes the control plane offline. There is one logical bus (apiserver-on-top-of-etcd), but the bus is itself a distributed system.

What is true is that all control-plane operations depend on this bus being healthy. If the apiserver is unreachable from a controller, that controller stops reconciling. If etcd is in a write-stall state, every write to the cluster blocks. The single bus design makes the bus the most important thing to keep working.

How clients should talk to the apiserver

A few patterns that follow from the bus design:

Use watch, not poll

If you are building a controller, use a watch not a periodic list. Watches are how every well-built component talks to the apiserver. The standard pattern:

// Pseudocode of the standard informer pattern from client-go
informer := factory.Core().V1().Pods().Informer()
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc:    func(obj interface{}) { /* enqueue for reconcile */ },
    UpdateFunc: func(old, new interface{}) { /* enqueue */ },
    DeleteFunc: func(obj interface{}) { /* enqueue */ },
})

The informer maintains a local cache populated from the watch. Your controller queries the cache, not the apiserver. Reconcile loops are fast because they read from memory; updates flow in from the apiserver via the watch.

A controller that polls instead of watches generates load proportional to the number of objects, not to the rate of change. At scale this becomes the bottleneck.

Use server-side apply for declarative writes

Server-side apply, introduced in 1.14, is the modern way to send desired state to the apiserver. Each field has an owner; concurrent applies from different controllers do not conflict. The pattern is "send the spec you want; the apiserver merges with respect to ownership."

This sounds like a small thing but is structurally important: it makes the apiserver the merge point for cluster state, not the client. Multiple controllers can manage different parts of the same object cleanly.

Respect the watch cache

The apiserver caches list and watch results. The cache is the reason a 1000-node cluster does not melt etcd: every kubelet does not get its own etcd watch.

But: the cache lags slightly behind etcd in pathological cases. If you need a strictly fresh read, use ?resourceVersion=0 (cache OK) vs ?resourceVersion= (most recent quorum read from etcd). Most reads are happy with cache-OK; pay attention only when you have a specific consistency requirement.

Do not bypass the apiserver

The most important rule. Even if you are running on the same node as etcd, even if your operator has a "good reason," you do not write to etcd directly. You write through the apiserver. Bypassing skips auth, admission, validation, audit, and the watch cache invalidation. Things break in subtle ways and you become the team's most-cursed engineer.

WAR STORY

A team's operator wrote a migration tool that read CRD instances directly from etcd because "the apiserver was too slow." Performance was great in dev. In production, the migration tool ran during an apiserver upgrade, at the moment the schema was being migrated. The tool read pre-migration objects, wrote post-migration semantics back, and corrupted half the CRD instances in the cluster. Restore from etcd backup took eight hours. The fix in the runbook: never bypass the apiserver, no matter how clever the reason. The performance "improvement" cost more than it ever saved.

When the universal bus becomes a bottleneck

Mostly the design wins. Sometimes it bites. The cases worth knowing:

High-cardinality CRDs

A controller managing 100,000 CRD instances generates real apiserver load, every reconcile is a list/watch, every write is an update. This is solvable (informer cache, server-side apply, pagination) but it is the controller author's job to do.

Watch storms

A change to a high-fan-out resource (a Service used by 5,000 pods, an Endpoint with 5,000 backends) triggers watch updates to every component that watches that kind. Done badly, the apiserver is hammered serializing the same object 5,000 times.

EndpointSlices were introduced to fix exactly this. Splitting Endpoints into smaller slices means a change to one slice does not trigger updates to every consumer of unrelated slices.

Admission webhook storms

If your webhook is on the path of every Pod create and the webhook is slow, every Pod create is slow, and pod creates are the most common write in Kubernetes. Misconfigured webhook = cluster-wide latency spike.

The apiserver protects against this with timeouts and the failurePolicy setting. But the asymmetry is real: a single bad webhook impacts the whole cluster.

Summary

The apiserver is the universal bus. Authentication, authorization, validation, admission, storage, watch, audit, all of these live in one component. Every other component talks to it; nothing else talks to anything else.

The payoffs are extensibility (CRDs and operators get the whole platform), observability (audit log is complete), and survivability (the bus is replicated, not single-instance).

The design implication: build controllers that use watch, not poll. Use server-side apply for writes. Respect the cache. Never bypass the apiserver. The bus design only works if everyone uses the bus.

The next lesson is the third leg of the architectural mental model: reconciliation loops. The pattern that lets controllers behave well even when the bus is intermittent, when other controllers are running concurrently, and when the cluster's state drifts from the desired spec.

The Control Plane and Data Plane Split

Continue

Reconciliation Loops Everywhere

←→ navigateM toggle sidebar