Kubernetes Architecture & Chaos

Reconciliation Loops Everywhere

If the control plane / data plane split is the geography of Kubernetes and the apiserver is its central nervous system, then the reconciliation loop is the universal motor. Every controller in Kubernetes: built-in or custom: works the same way: observe current state, compare to desired state, take an action that moves current toward desired, repeat.

This sounds boring. It is the boring discipline that makes the system work despite partial failures, race conditions, and operator errors. The whole reason Kubernetes feels resilient, pods reschedule, services repoint, deployments converge: is that hundreds of these reconciliation loops are running concurrently, each idempotent, each level-triggered, each ready to retry on the next tick.

KEY CONCEPT

Reconciliation is level-triggered (act on the current state, not on the event that changed it), idempotent (running it twice is the same as running it once), and convergent (the system tends toward the desired state over time). These three properties together are why Kubernetes can lose events, restart controllers, and recover from arbitrary failures without manual intervention.

The pattern in 10 lines

Every controller in Kubernetes reduces to this loop:

for {
    // Observe
    desired := getDesiredState()  // from apiserver: spec
    actual := getActualState()    // from apiserver or world: status

    // Compare and act
    if actual != desired {
        action := compute(actual, desired)
        apply(action)             // calls back to apiserver
    }

    // Wait for next change
    waitForEventOrTimer()
}

That is it. The Deployment controller, the ReplicaSet controller, the Endpoint controller, your custom operator, all of them are this loop with different getDesiredState and compute functions.

A few details that make it real in practice:

"Wait for next change" is usually a watch on the apiserver (Module 1.2 covered this).
"Compare" usually walks both sides field by field. Server-side apply makes this much cleaner.
"Act" almost always means "send a write to the apiserver." Controllers do not directly start pods or call cloud APIs except in narrow cases (the cloud-controller-manager, CSI controller, etc.).

Level-triggered vs edge-triggered: the foundational choice

The most important word in the previous section is level-triggered.

In an edge-triggered system, you react to events: "a Pod was created," "a Pod was deleted." If you miss an event: because your controller crashed, because a network blip dropped the message, because the system is slower than the events, you are out of sync, and there is no automatic way to recover.

In a level-triggered system, you react to state: "what should be true right now? what is true right now? close the gap." Missing an event does not matter, because the next time you observe, you see the current state and act on it. The system is what it is; you do not need to remember what it was.

Watches in Kubernetes look edge-triggered (they deliver events), but the controllers built on top of them are level-triggered. The pattern: a watch event triggers a reconcile, but the reconcile reads the current state and computes an action from there, not from the contents of the event. The event is a hint that "something changed, look again," not a description of what to do.

This sounds pedantic and is one of the deepest design choices in the system. It is what lets a controller restart and resume work without losing state. It is what lets multiple controllers act on the same resource without a coordination layer. It is what makes Kubernetes a system that converges rather than one that executes.

Idempotence: the property that lets retries be safe

Reconciliation must be idempotent. Running it twice produces the same result as running it once.

Why this matters: a controller can crash mid-reconcile, restart, and re-run the same reconcile from scratch. If the reconcile is not idempotent, you get double-creates, partial state, or correctness bugs. If it is idempotent, the second run is a no-op (because the first run already converged).

The pattern:

func reconcile(name string, desired Spec) error {
    // Idempotent: check first, only act on the gap
    actual, err := get(name)
    if err == NotFound {
        return create(name, desired)   // idempotent under "first one wins"
    }
    if !equal(actual, desired) {
        return update(name, desired)   // idempotent under "result equals desired"
    }
    return nil  // already correct, no-op
}

The first call creates the resource. The second call sees it already exists and matches, returns no-op. A third call after a manual edit sees a mismatch and updates. Run it 100 times in a row and the cluster ends up the same.

In contrast, a non-idempotent reconcile would do something like:

// DO NOT WRITE THIS
func badReconcile(name string, desired Spec) error {
    return create(name, desired)  // fails on second call with AlreadyExists
}

The bad version requires the caller to know whether it has run before. That is the kind of state controllers cannot reliably keep, the whole point of reconciliation is that the controller is stateless across runs.

Convergence: the property that makes the system robust

A reconciliation loop converges if, eventually, current state equals desired state and the loop becomes a steady stream of no-ops.

Convergence is not guaranteed by idempotence alone. Two examples of non-convergent loops:

Oscillation

Controller A wants 5 replicas; controller B wants 3. They both reconcile a Deployment. A bumps to 5, B drops to 3, A bumps to 5, etc.

The Kubernetes solution: field ownership via server-side apply. Each controller declares which fields it manages; the apiserver enforces no two controllers manage the same field. If two controllers conflict, one of them returns an error rather than overwriting.

Slow convergence

A controller that creates one resource per reconcile when it should create ten. The system trickles toward desired state but takes minutes when it should take seconds.

This is usually a bug in the controller's compute step, it is reacting to events instead of state. The fix: in each reconcile, look at all the gaps and fix them, not just the one closest to the most recent event.

The Kubernetes-style fix: never assume which event triggered the reconcile. Just compute the full diff and act on all of it.

How a Deployment converges in practice

A worked example to make the abstract concrete. You apply a Deployment with replicas: 3. What happens, controller by controller:

Each step is its own controller doing its own reconcile. None of them know about the others; each just watches the apiserver, observes a gap, and closes it.

The same flow handles failure: kill the ReplicaSet controller halfway through and the next time it runs (or the next replica that crashes during ReplicaSet downtime), it observes the gap (3 pods desired, 1 actual) and creates the missing two. The system converges.

Why this pattern wins under partial failure

The reconciliation pattern's value shows up specifically when things go wrong. Three failure modes and how reconciliation handles them:

Controller crashes mid-action

A controller is creating Pod 2 of 3 when its container is killed. It restarts. It re-watches, re-observes the state (1 pod exists, 3 desired), and creates Pods 2 and 3. The crash had no lasting effect because the action was idempotent.

Compare to an event-driven system where the create-Pod-2 event was already consumed: that controller would forget about Pod 2 forever.

Network partition between controller and apiserver

A controller's watch connection drops. It cannot see new events. The fix: when the watch reconnects, the controller does a full re-list and re-reconciles everything. The list is the level read; the cluster's state is whatever the list says it is.

This is built into client-go's informers. A reconnect after a partition triggers a full resync. Controllers that follow the standard pattern get this for free.

Two controllers acting on the same object

Avoid where possible. When unavoidable, server-side apply with field ownership. Each controller declares the fields it manages; the apiserver rejects writes to fields owned by others.

If you do not use server-side apply, the last write wins, and you can get oscillation. Server-side apply is the right answer for any modern controller.

Anti-patterns: things that break the model

A few common patterns that look reasonable and are not:

Storing state in the controller

A controller that holds a map of "Pods I have already created" in memory. When the controller restarts, the map is empty, and the controller might create duplicates.

Right answer: do not store anything in the controller. Read the state from the apiserver every reconcile. The apiserver is the source of truth.

Reacting to events instead of state

A controller that says "I got a Pod created event, so I will create a corresponding ConfigMap." If the create-ConfigMap step fails after the create-Pod event, the controller never retries, it is waiting for another Pod event.

Right answer: each reconcile observes the full state. "There is a Pod with no matching ConfigMap", create the ConfigMap. The trigger is the gap, not the event. The next reconcile cycle will see the gap if the first attempt failed.

Synchronous chains

A controller that, in one transaction, creates a Pod and updates a Service and writes a Secret. If any step fails, the rollback is the controller's responsibility.

Right answer: each reconcile makes one or a few state changes; the next reconcile sees the new state and continues. The controller does not need transactions because it can retry indefinitely.

Long-held locks

A controller that grabs a distributed lock at the start of reconcile and holds it for ten minutes. While held, no other replica of the controller can run. If the holder crashes, the lock might not release.

Right answer: leader election (with coordination.k8s.io/Lease), exactly one replica is the active controller at any time, others are warm standbys. The active controller does not need any other lock; it is the only one running. If it dies, the standbys see the lease expire and one becomes active.

This is the pattern controller-manager uses internally. Multiple controllers running for HA, exactly one active.

Building your own controller

If you write a custom operator, the standard stack is:

client-go for the apiserver client.
controller-runtime for the controller scaffolding (informers, work queues, reconcile loops).
kubebuilder as the project generator that wires it all together.

The pattern that emerges is essentially:

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var obj acmev1.MyResource
    if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {
        if errors.IsNotFound(err) {
            return ctrl.Result{}, nil  // object gone, nothing to do
        }
        return ctrl.Result{}, err
    }

    // Compute desired state from obj.Spec
    desired := computeDesired(&obj)

    // Get actual state
    actual, err := r.getActual(ctx, &obj)
    if err != nil {
        return ctrl.Result{}, err
    }

    // Apply diff
    if err := r.applyDiff(ctx, actual, desired); err != nil {
        return ctrl.Result{Requeue: true}, err  // retry
    }

    // Update status
    return ctrl.Result{}, r.updateStatus(ctx, &obj, ...)
}

The Reconcile function is idempotent by construction. Returning an error or Requeue: true causes the framework to call Reconcile again later. There is no need for "did I already do this" state, the next call observes the current state and decides.

PRO TIP

The first time you write a controller, fight the urge to make it event-driven. The framework will deliver events for you, but your reconcile body should not care which event triggered it. Pretend every reconcile is "wake up cold, look at the state, do the right thing." Controllers written this way are short, robust, and tolerant of restarts. Controllers written event-by-event are long, fragile, and embarrass the team in the next outage.

Reconciliation outside core Kubernetes

The pattern shows up everywhere once you recognize it:

Argo CD reconciles a Git repository against the cluster, same pattern, different "desired state" source.
Terraform is reconciliation against a state file.
Ansible's idempotent modules are local-mode reconciliation.
Routing protocols (OSPF, BGP) are level-triggered convergence at network scale.
Spreadsheet recalculation is reconciliation over a graph of cells.

The reason all of these survive partial failure is the same reason Kubernetes does: level-triggered, idempotent, convergent reconciliation. Once you see the pattern, you see it everywhere, and you also see the systems that fail to use it (database migrations that are not idempotent, deploy scripts that assume "started from clean," queue consumers that lose messages on crash) and you understand why those systems are fragile.

Summary

Reconciliation is the universal pattern of Kubernetes control. Every controller observes desired state, compares to actual state, takes an action to close the gap, and waits for the next change. The properties that matter:

Level-triggered: act on current state, not on events.
Idempotent: running twice equals running once.
Convergent: the system tends toward desired state over time.

These three together make the system tolerant of crashes, restarts, partitions, and concurrent actors. Kubernetes works because of this pattern, not in spite of any single component being correct.

Module 1 is done. You have the architecture mental model: control plane decides, data plane does, apiserver is the bus, every controller is a reconciliation loop. Module 2 zooms into the apiserver itself, the request lifecycle that every kubectl command and every controller write goes through.

The API Server as the Universal Bus

Continue

Request Lifecycle Through the API Server

←→ navigateM toggle sidebar