Kubernetes Cluster Upgrades with kubeadm

Building Your Upgrade Plan

You manage 3 clusters: dev (5 nodes), staging (20 nodes), and production (200 nodes across 3 availability zones). The CTO sends an email: "We need to be on Kubernetes 1.31 by end of quarter for compliance. Make it happen."

You have 12 weeks. You are currently on 1.29, which means two minor version hops (1.29 to 1.30, then 1.30 to 1.31). Each hop needs testing. Each cluster needs its own upgrade window. Production has 200 nodes that need to be drained and upgraded without downtime.

You cannot wing this. You need a plan: a real one, with timelines, rollback criteria, and communication checkpoints. This lesson teaches you how to build that plan.

Part 1: The Upgrade Sequence: Environment Order

Never upgrade production first. This is obvious but worth stating explicitly because under deadline pressure, people do obviously wrong things.

The Pipeline: Dev, Staging, Production

Every upgrade flows through environments in order:

Phase	Environment	Purpose	Duration
1	Dev	Catch obvious breakage, test the upgrade process itself	1-2 days
2	Staging	Validate workloads, test integrations, soak test	3-5 days
3	Production	The real thing, with maintenance window and rollback plan	1-3 days

Dev is your throwaway environment. If the upgrade fails here, nothing is lost. Use dev to practice the upgrade commands, identify any kubeadm issues, and discover missing prerequisites.

Staging is your confidence builder. It should mirror production as closely as possible (same Kubernetes version, same add-ons, same network policies, same admission webhooks). Staging catches the issues that dev misses: workload incompatibilities, webhook failures, CRD issues.

Production is where the stakes are real. By the time you reach production, you should have already done the upgrade twice (dev + staging) and know exactly what to expect.

KEY CONCEPT

The upgrade pipeline is not just about finding bugs, it is about building muscle memory. By the time you upgrade production, you should be able to do it in your sleep because you have already done it twice. Every surprise should have been encountered and resolved in dev or staging.

Multi-Hop Upgrades

If you need to jump multiple minor versions (e.g., 1.29 to 1.31), you run the entire pipeline for each hop:

Hop 1 (1.29 → 1.30):
  Dev:     Days 1-2
  Staging: Days 3-7
  Prod:    Days 8-10

Hop 2 (1.30 → 1.31):
  Dev:     Days 11-12
  Staging: Days 13-17
  Prod:    Days 18-20

Total: approximately 4 weeks for a 2-hop upgrade done properly. This is why the CTO giving you 12 weeks is actually reasonable, but only if you start now.

Part 2: The Pre-Upgrade Checklist

Before touching any cluster, complete this checklist. Print it. Check off each item.

Infrastructure Readiness

- [ ] Current cluster version confirmed across all components
      (API server, controller-manager, scheduler, kubelets, etcd)
- [ ] Target version identified (latest patch of the target minor version)
- [ ] kubeadm, kubelet, and kubectl packages available for target version
- [ ] etcd compatibility confirmed for target Kubernetes version
- [ ] Node OS packages up to date (container runtime, iptables/nftables)
- [ ] Sufficient disk space on control plane nodes for etcd backup
- [ ] Container runtime (containerd/CRI-O) compatible with target version

Workload Readiness

- [ ] Deprecated API audit complete (pluto/kubent scan — see Module 2, Lesson 1)
- [ ] All deprecated APIs migrated to supported versions
- [ ] Pod Disruption Budgets (PDBs) configured for critical workloads
- [ ] Third-party operators checked for target version compatibility
- [ ] Admission webhooks tested against target API server version
- [ ] CRDs verified compatible with target version
- [ ] Helm charts checked for deprecated template functions

Backup Readiness

- [ ] etcd snapshot taken and verified (see Module 2, Lesson 2)
- [ ] etcd snapshot stored in external storage (not on the control plane node)
- [ ] Cluster resource manifests exported (kubectl get all --all-namespaces -o yaml)
- [ ] Persistent volume snapshots taken for critical data
- [ ] Restore procedure tested in an isolated environment

PRO TIP

Store this checklist in your team wiki or runbook repo and update it after every upgrade. Each upgrade teaches you something new, add it to the checklist so the next upgrade benefits from the lesson.

Part 3: Communication and Coordination

An upgrade is not just a technical operation. It involves people.

Stakeholder Notification

Send advance notice to these groups:

Audience	When	What to Tell Them
Engineering teams	2 weeks before prod upgrade	"Kubernetes upgrade on [date]. Deploy freeze starts [date]. Expected duration: [X hours]."
SRE/On-call team	1 week before	Full runbook, rollback criteria, escalation contacts
Management	1 week before	Timeline, risk assessment, business impact if things go wrong
Security team	After completion	"Upgrade complete. Now on [version]. Compliance requirement met."

Deploy Freeze

Implement a deploy freeze starting 24-48 hours before the production upgrade:

No new deployments to production
No Helm releases
No CRD updates
No admission webhook changes

The deploy freeze exists because you need a stable baseline. If someone deploys a broken app during the upgrade, you will waste hours figuring out whether the breakage is from the upgrade or the deployment.

WARNING

Deploy freezes are unpopular. Engineers will push back, especially if they have features waiting to ship. Hold the line. A deploy freeze of 48 hours is far less disruptive than a failed upgrade that takes down production for a day because someone deployed an incompatible webhook during the maintenance window.

Maintenance Window

Define a maintenance window with these parameters:

Maintenance Window:
- Start: Saturday 02:00 UTC (lowest traffic period)
- Duration: 4 hours (control plane) + rolling node upgrades through Sunday
- Expected impact: Pods will be rescheduled during node drains (brief disruptions)
- Communication channel: #kubernetes-upgrade (Slack)
- Incident commander: [Name]
- Decision maker for rollback: [Name]

Part 4: Rollback Criteria

Define your rollback criteria before you start the upgrade. If you wait until things are breaking to decide whether to roll back, you will spend too long debating and not enough time acting.

Rollback Triggers

Define specific, measurable conditions that trigger a rollback:

ROLLBACK IF ANY OF THESE ARE TRUE:
1. kube-apiserver is unreachable for more than 5 minutes after upgrade
2. More than 10% of pods enter CrashLoopBackOff within 30 minutes
3. Any critical workload (defined list) is down for more than 10 minutes
4. etcd cluster loses quorum
5. Node NotReady count exceeds 20% of total nodes
6. The upgrade process itself fails midway and cannot be resumed

Rollback Procedure

For kubeadm clusters, rollback means:

Control plane rollback: Restore etcd from pre-upgrade snapshot, reinstall previous version of kubeadm/kubelet/kubectl, run kubeadm init with the old version config
Node rollback: Reinstall previous kubelet version, restart kubelet service

WAR STORY

A team once defined their rollback criteria as "if anything goes wrong." During the production upgrade, a single non-critical monitoring pod went into CrashLoopBackOff because of a minor API incompatibility. They rolled back the entire cluster, losing 6 hours. The fix for the monitoring pod was a one-line manifest change. Define specific, severity-based rollback criteria, not "anything goes wrong."

KEY CONCEPT

The person who decides to rollback should be named in advance. During an incident, "who decides?" wastes precious minutes. Designate a single decision maker (typically the on-call lead or SRE manager) and give them authority to call it.

Part 5: The Three Upgrade Strategies

There are three approaches to upgrading a cluster. Each has trade-offs.

Strategy 1: In-Place Upgrade

Upgrade the existing cluster components in place. This is what kubeadm upgrade does.

Pros: Simpler, no duplicate infrastructure, preserves cluster identity and state Cons: Rollback is harder (restore from backup), brief pod disruptions during node drains Best for: Most clusters, especially smaller ones

# In-place upgrade with kubeadm
kubeadm upgrade apply v1.30.3
# Then upgrade kubelets node by node

Strategy 2: Blue-Green Cluster

Build an entirely new cluster at the target version. Migrate workloads from old (blue) to new (green). Cut over DNS/load balancer when ready.

Pros: Zero-risk rollback (just point traffic back to blue), clean cluster with no upgrade artifacts Cons: Double the infrastructure cost during migration, complex workload migration, stateful workloads are hard to move Best for: Large production clusters where downtime is unacceptable and budget allows

Strategy 3: Rolling Node Replacement

Upgrade the control plane in-place, but replace worker nodes instead of upgrading them. Add new nodes at the target version, drain and remove old nodes.

Pros: Worker nodes are clean (no upgrade artifacts), easier than full blue-green Cons: Requires spare capacity to add nodes before removing old ones, control plane still upgraded in-place Best for: Cloud environments where nodes are disposable (ASGs, node pools)

In-Place vs Blue-Green Upgrade

In-Place Upgrade

Upgrade components on existing cluster

Infrastructure costNo additional cost

Rollback methodetcd restore (complex)

Downtime riskBrief pod disruptions during drains

ComplexityLow to moderate

Stateful workloadsStay in place

Best forSmall to mid-size clusters

Blue-Green Cluster

Build new cluster, migrate workloads

Infrastructure cost2x during migration

Rollback methodDNS/LB switch back (simple)

Downtime riskNear-zero with proper cutover

ComplexityHigh

Stateful workloadsMust migrate data (hard)

Best forLarge clusters, zero-downtime requirements

Part 6: Building the Timeline

Here is a concrete timeline template for a single-hop upgrade across three environments:

=== KUBERNETES UPGRADE: 1.29 → 1.30 ===

WEEK 1: Preparation
  Mon-Tue: Deprecated API audit and remediation
  Wed:     etcd backup procedure tested
  Thu:     Staging cluster backup taken
  Fri:     Upgrade dev cluster (control plane + nodes)

WEEK 2: Staging
  Mon:     Review dev upgrade results, fix any issues
  Tue:     Upgrade staging control plane
  Wed-Thu: Upgrade staging worker nodes (rolling)
  Fri:     Begin staging soak test

WEEK 3: Staging Soak + Prod Prep
  Mon-Wed: Continue staging soak test, run integration tests
  Thu:     Production pre-upgrade checklist complete
  Fri:     Send stakeholder notification, confirm maintenance window

WEEK 4: Production
  Mon:     Deploy freeze begins
  Tue:     Take production etcd backup, verify restore
  Wed:     Maintenance window: upgrade production control plane
  Thu-Fri: Rolling upgrade of production worker nodes
  Sat:     Deploy freeze lifted, post-upgrade monitoring

WEEK 5: Verification
  Mon-Fri: Monitor production for any delayed issues
  Fri:     Upgrade declared complete, retrospective scheduled

PRO TIP

Add buffer time. If the timeline says "4 weeks," plan for 5. Upgrades always take longer than expected: a failed node drain, a stubborn PDB, an operator that needs a manual update. The buffer is not slack time; it is reality.

Part 7: Risk Assessment

Document what can go wrong at each phase and what you will do about it:

Risk	Likelihood	Impact	Mitigation
Deprecated API breaks workload	Medium	High	Pre-upgrade API audit (Module 2, Lesson 1)
etcd backup is corrupt	Low	Critical	Test restore before every upgrade
Admission webhook rejects new API fields	Medium	Medium	Test webhooks in staging
Node drain takes too long (PDB blocks)	High	Low	Set drain timeout, manually review stuck PDBs
Third-party operator incompatible	Medium	High	Check operator release notes for target K8s version
Control plane upgrade fails midway	Low	Critical	etcd restore to pre-upgrade state
kubelet fails to start after upgrade	Low	Medium	Reinstall previous kubelet version on affected node
Network plugin incompatible	Low	Critical	Verify CNI plugin supports target version before starting

Part 8: The Upgrade Runbook Template

Use this as a starting point for your team's runbook:

# Kubernetes Upgrade Runbook: [CURRENT] → [TARGET]

## Pre-Upgrade
1. Confirm current versions: `kubectl version`, node kubelets
2. Complete deprecated API audit
3. Take and verify etcd backup
4. Confirm target version packages available
5. Notify stakeholders
6. Begin deploy freeze

## Control Plane Upgrade
1. Upgrade kubeadm: `apt-get install kubeadm=[TARGET_VERSION]`
2. Verify upgrade plan: `kubeadm upgrade plan`
3. Apply upgrade: `kubeadm upgrade apply v[TARGET_VERSION]`
4. Verify API server health: `kubectl get nodes`
5. Upgrade kubelet on control plane nodes
6. Restart kubelet: `systemctl restart kubelet`
7. Verify control plane pods: `kubectl get pods -n kube-system`

## Worker Node Upgrade (per node)
1. Cordon node: `kubectl cordon [NODE]`
2. Drain node: `kubectl drain [NODE] --ignore-daemonsets --delete-emptydir-data`
3. Upgrade kubeadm: `apt-get install kubeadm=[TARGET_VERSION]`
4. Upgrade node config: `kubeadm upgrade node`
5. Upgrade kubelet and kubectl packages
6. Restart kubelet: `systemctl restart kubelet`
7. Uncordon node: `kubectl uncordon [NODE]`
8. Verify node ready: `kubectl get node [NODE]`
9. Wait for pods to reschedule and become healthy
10. Proceed to next node

## Post-Upgrade Verification
1. All nodes at target version: `kubectl get nodes`
2. All system pods healthy: `kubectl get pods -n kube-system`
3. Application pods healthy across all namespaces
4. Run smoke test suite
5. Check monitoring dashboards for anomalies
6. Lift deploy freeze
7. Send completion notification

## Rollback Procedure
1. If control plane fails: restore etcd, reinstall previous kubeadm/kubelet
2. If worker node fails: reinstall previous kubelet, restart
3. Decision maker for rollback: [NAME]
4. Rollback triggers: [LIST FROM ROLLBACK CRITERIA]

KEY CONCEPT

A runbook is not documentation, it is an executable checklist. Every step should be a concrete command or action, not a vague instruction like "verify things are working." If someone who has never done the upgrade before cannot follow the runbook step by step and succeed, it is not detailed enough.

Key Concepts Summary

Always upgrade in order: dev, staging, production, never skip environments
A deploy freeze before production upgrades prevents debugging confusion from concurrent changes
Define rollback criteria before starting: specific, measurable conditions, not "if anything goes wrong"
Name a single rollback decision maker to avoid wasting time debating during an incident
Three strategies: in-place, blue-green, rolling node replacement, in-place with kubeadm is the most common
Multi-hop upgrades multiply the timeline, each hop needs its own full testing cycle
Build buffer into your timeline: upgrades always take longer than planned
The runbook must be executable: concrete commands, not vague instructions
Communication is part of the plan: stakeholders, deploy freezes, and maintenance windows are not optional

Common Mistakes

Starting with production because "we are behind and need to catch up fast", this is how outages happen
Not defining rollback criteria before starting, leads to paralysis during incidents
Setting a deploy freeze that is too short, 24 hours minimum, 48 hours preferred
Skipping the staging soak test to save time, staging catches the issues that appear after hours, not minutes
Writing a runbook with vague steps like "verify cluster health" instead of specific commands
Not testing the etcd restore procedure, a backup you have not tested is not a backup
Underestimating the time for multi-hop upgrades, 2 hops is not 2x the work, it is closer to 3x
Forgetting to check third-party operator compatibility: CNI plugins, ingress controllers, and cert-manager all have K8s version requirements

KNOWLEDGE CHECK

You need to upgrade from Kubernetes 1.28 to 1.31 across dev, staging, and production environments. What is the minimum number of separate upgrade operations you need to perform?

Version Skew Policy Deep Dive

Continue

Deprecated API Audit

←→ navigateM toggle sidebar