All posts
Kubernetes

The Kubernetes Upgrade Preflight Checklist

Every Kubernetes upgrade I've watched fail in production failed for a reason that was visible an hour earlier. Here's the checklist.

By Sharon Sahadevan··8 min read

Every failed Kubernetes upgrade I've watched in production failed for a reason that was visible an hour before anyone touched a control plane. Deprecated APIs still in use. Pod disruption budgets that block drains. An admission webhook with failurePolicy: Fail whose pod has been crashlooping for a week and nobody noticed. The version skew between kubelet and apiserver that kubeadm warned about and the operator skipped past.

This post is the preflight checklist I run before every minor-version upgrade. It catches roughly 90% of the issues that would otherwise show up mid-upgrade with the control plane half-rotated and a production cluster in a weird state.

Step 1: find every deprecated API still in use#

This is the #1 cause of upgrade rollbacks I've seen. A chart you installed two years ago still has apiVersion: extensions/v1beta1 somewhere, and the upgrade you're about to do removes it. The Deployment renders fine on the old version, fails to admit on the new one, and now nothing in that namespace can reconcile.

You have to check three places: live objects in etcd, stored manifests (Git/Helm), and rendered output of any chart you don't directly manage.

Live objects. Use kubectl-deprecations, pluto, or the API server's own apiserver_requested_deprecated_apis metric:

# Pluto across the live cluster
pluto detect-all-in-cluster --target-versions k8s=v1.31.0

# Or query Prometheus if you scrape the apiserver
sum by (group, version, resource) (
  apiserver_requested_deprecated_apis{removed_release="1.31"}
)

The Prometheus query is the one I trust most. The apiserver itself increments this counter whenever a client (kubectl, an operator, a CI job) sends a request using a deprecated API. If the counter is non-zero in the 30 days before your upgrade, someone or something is still using it. Find the user-agent and fix it, don't just upgrade and hope.

Stored manifests. Run pluto detect-files -d ./ against the directory containing your Helm values and rendered manifests. Then run it against the output of helm template for every chart, because chart authors sometimes use deprecated APIs in templates that you don't see until rendering.

Operators and controllers. This is the one people forget. Run kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}' and check that every operator's image version supports the Kubernetes version you're upgrading to. An old cert-manager or prometheus-operator that still reaches for a removed API will start crashlooping minutes after the apiserver upgrade.

KEY CONCEPT

The apiserver_requested_deprecated_apis metric is the only ground truth for what's actually being used. Static manifest scanning catches what's in your repo. Live-object scanning catches what's currently stored. Only the apiserver metric catches the cron job that runs once a week with a hardcoded apiVersion.

Step 2: find the PDBs that won't let you drain#

kubeadm upgrade node (or your managed equivalent) will cordon and drain each node before upgrading the kubelet. The drain respects PodDisruptionBudgets. If a PDB says minAvailable: 100% for a workload that has no excess capacity on other nodes, the drain hangs forever.

This is the #2 cause of stalled upgrades. The fix is to find them before you start:

# Find every PDB in the cluster
kubectl get poddisruptionbudgets -A -o json | jq -r '
  .items[] | [
    .metadata.namespace,
    .metadata.name,
    (.spec.minAvailable // "n/a"),
    (.spec.maxUnavailable // "n/a"),
    .status.disruptionsAllowed
  ] | @tsv
'

The column you care about most is disruptionsAllowed. A PDB with disruptionsAllowed: 0 will block any drain that would touch one of its pods. This is sometimes deliberate (you really do need 100% availability) and sometimes a misconfigured chart that needs adjustment.

Find every PDB with disruptionsAllowed: 0 and triage:

  • Can you scale the workload up so a disruption is allowed? (kubectl scale deployment foo --replicas=N+1)
  • Can you adjust minAvailable to allow at least one disruption? (Often minAvailable: 100% was a copy-paste from a chart's default and was never reconsidered.)
  • Is the workload pinned to a specific node via nodeSelector or nodeAffinity? If so, no amount of replica scaling will let you drain that node.
WAR STORY

An upgrade I ran on a 200-node cluster stalled at node #47. A single Redis StatefulSet had minAvailable: 50% with 4 replicas and aggressive anti-affinity that pinned them to 4 specific nodes. Three of the four nodes were already cordoned by the upgrade. Drain of node #47 (which hosted the 4th replica) would violate the PDB. The cluster sat in a half-upgraded state for two hours while we figured out which workload was blocking it. Lesson: triage every disruptionsAllowed: 0 PDB before, not during.

Step 3: admission webhooks and the failurePolicy: Fail trap#

Mutating and validating admission webhooks sit in the request path of the apiserver. If a webhook is configured with failurePolicy: Fail and its backing pod is unreachable, every API request that matches the webhook's rules will be rejected.

During an upgrade, your control plane briefly restarts. If a webhook with failurePolicy: Fail is, say, mutating Pods cluster-wide and its own pod hasn't come back yet, you've just blocked your own ability to schedule the kube-apiserver's static pod replacement on the next control plane node. Cluster goes from "upgrading" to "broken" in one step.

Find them:

kubectl get mutatingwebhookconfigurations -o json | jq -r '
  .items[] |
  .metadata.name as $name |
  .webhooks[] |
  select(.failurePolicy == "Fail") |
  [$name, .name, .failurePolicy, (.rules[].operations | join(","))] | @tsv
'

kubectl get validatingwebhookconfigurations -o json | jq -r '
  .items[] |
  .metadata.name as $name |
  .webhooks[] |
  select(.failurePolicy == "Fail") |
  [$name, .name, .failurePolicy, (.rules[].operations | join(","))] | @tsv
'

For each one with failurePolicy: Fail:

  • Is the backing service healthy right now? (kubectl get endpoints -n <ns> <service>)
  • Is the backing pod scheduled on a control plane node? If yes, what happens when that node is drained?
  • Does the webhook scope a namespaceSelector that excludes kube-system? If not, you have a self-defeating loop waiting to happen.
  • Does it actually need failurePolicy: Fail, or could it tolerate Ignore during an upgrade window?

The pragmatic move during an upgrade: temporarily flip non-critical webhooks to failurePolicy: Ignore, do the upgrade, flip back. The cert-manager, opa-gatekeeper, kyverno, and similar admission webhooks all support this, and skipping it has caused more "the entire cluster is read-only" incidents than I want to count.

The full step-by-step procedure for kubeadm upgrades, including the order of operations across control plane nodes, the etcd snapshot strategy, and the rollback path when something does go sideways, is in Kubernetes Cluster Upgrades with kubeadm. It's the playbook I wish I'd had the first time I upgraded a production cluster.

Step 4: version skew rules nobody reads#

Kubernetes has documented version skew policies that the upgrade tooling enforces, but the rules are easy to forget if you're upgrading once a year:

  • kube-apiserver: newer than other components by at most 1 minor version. The apiserver should be upgraded first.
  • kubelet: at most 3 minor versions older than the apiserver (was 2 in older versions; verify against the docs for your target).
  • kube-controller-manager, kube-scheduler, cloud-controller-manager: at most 1 minor version older than the apiserver they communicate with.
  • kubectl: within 1 minor of the apiserver.

The trap: if you're skipping a minor version (say, going from 1.29 → 1.31 in a single window), you might violate the skew rule for kubelets that haven't been upgraded yet. The standard rule is upgrade one minor version at a time, full stop. Skipping versions is supported only with very careful kubelet management.

Run this before you start to baseline your skew:

echo "Apiserver version:"
kubectl version -o json | jq -r '.serverVersion.gitVersion'

echo "Kubelet versions across nodes:"
kubectl get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kubeletVersion}{"\n"}{end}' | sort | uniq -c

If you see more than two minor versions across your kubelets, fix that before the apiserver upgrade.

Step 5: the things I always do anyway#

A few items that aren't strictly preflight but should happen in the same window:

  • Snapshot etcd. If you're running self-managed control plane, take a fresh etcd snapshot and verify you can restore it on a separate machine. A snapshot you can't restore is not a snapshot.
  • Verify your cluster autoscaler will not panic. During the drain phase, autoscalers sometimes interpret cordoned nodes as "underutilized" and try to scale them down, racing the upgrade. Pause autoscaling for the upgrade window.
  • Check storage CSI drivers. A CSI driver that doesn't support the new Kubernetes version will fail to attach new volumes. Most managed CSI drivers ship version compatibility matrices; read the one for your provider.
  • Run a canary upgrade in staging that mirrors prod's workload mix. Not just an empty test cluster: one with the same operators, the same webhooks, the same PDBs. The only failure modes that matter are the ones that show up under real configuration.

A short final word#

Kubernetes upgrades are boring when you do them well and catastrophic when you don't. The difference between the two is almost entirely in the hour of preflight checks before you start. Run the checklist. Find the things. Fix them in advance. Then the upgrade itself is just kubeadm upgrade apply and a coffee.

The rest of the operational playbook. Day 2 ops across self-managed, EKS, GKE, and AKS, including upgrade orchestration, identity, storage, networking, and DR, is what we cover in Production Kubernetes Operations.

If you want this kind of pre-flight rigor in your inbox weekly, Kubenatives is where I publish operational notes for ~3,500 production Kubernetes engineers. One issue per week, no fluff, the same level of specificity as this checklist.