Building Your Upgrade Plan
You manage 3 clusters: dev (5 nodes), staging (20 nodes), and production (200 nodes across 3 availability zones). The CTO sends an email: "We need to be on Kubernetes 1.31 by end of quarter for compliance. Make it happen."
You have 12 weeks. You are currently on 1.29, which means two minor version hops (1.29 to 1.30, then 1.30 to 1.31). Each hop needs testing. Each cluster needs its own upgrade window. Production has 200 nodes that need to be drained and upgraded without downtime.
You cannot wing this. You need a plan: a real one, with timelines, rollback criteria, and communication checkpoints. This lesson teaches you how to build that plan.
Part 1: The Upgrade Sequence: Environment Order
Never upgrade production first. This is obvious but worth stating explicitly because under deadline pressure, people do obviously wrong things.
The Pipeline: Dev, Staging, Production
Every upgrade flows through environments in order:
| Phase | Environment | Purpose | Duration |
|---|---|---|---|
| 1 | Dev | Catch obvious breakage, test the upgrade process itself | 1-2 days |
| 2 | Staging | Validate workloads, test integrations, soak test | 3-5 days |
| 3 | Production | The real thing, with maintenance window and rollback plan | 1-3 days |
Dev is your throwaway environment. If the upgrade fails here, nothing is lost. Use dev to practice the upgrade commands, identify any kubeadm issues, and discover missing prerequisites.
Staging is your confidence builder. It should mirror production as closely as possible (same Kubernetes version, same add-ons, same network policies, same admission webhooks). Staging catches the issues that dev misses: workload incompatibilities, webhook failures, CRD issues.
Production is where the stakes are real. By the time you reach production, you should have already done the upgrade twice (dev + staging) and know exactly what to expect.
The upgrade pipeline is not just about finding bugs, it is about building muscle memory. By the time you upgrade production, you should be able to do it in your sleep because you have already done it twice. Every surprise should have been encountered and resolved in dev or staging.
Multi-Hop Upgrades
If you need to jump multiple minor versions (e.g., 1.29 to 1.31), you run the entire pipeline for each hop:
Hop 1 (1.29 → 1.30):
Dev: Days 1-2
Staging: Days 3-7
Prod: Days 8-10
Hop 2 (1.30 → 1.31):
Dev: Days 11-12
Staging: Days 13-17
Prod: Days 18-20
Total: approximately 4 weeks for a 2-hop upgrade done properly. This is why the CTO giving you 12 weeks is actually reasonable, but only if you start now.
Part 2: The Pre-Upgrade Checklist
Before touching any cluster, complete this checklist. Print it. Check off each item.
Infrastructure Readiness
- [ ] Current cluster version confirmed across all components
(API server, controller-manager, scheduler, kubelets, etcd)
- [ ] Target version identified (latest patch of the target minor version)
- [ ] kubeadm, kubelet, and kubectl packages available for target version
- [ ] etcd compatibility confirmed for target Kubernetes version
- [ ] Node OS packages up to date (container runtime, iptables/nftables)
- [ ] Sufficient disk space on control plane nodes for etcd backup
- [ ] Container runtime (containerd/CRI-O) compatible with target version
Workload Readiness
- [ ] Deprecated API audit complete (pluto/kubent scan — see Module 2, Lesson 1)
- [ ] All deprecated APIs migrated to supported versions
- [ ] Pod Disruption Budgets (PDBs) configured for critical workloads
- [ ] Third-party operators checked for target version compatibility
- [ ] Admission webhooks tested against target API server version
- [ ] CRDs verified compatible with target version
- [ ] Helm charts checked for deprecated template functions
Backup Readiness
- [ ] etcd snapshot taken and verified (see Module 2, Lesson 2)
- [ ] etcd snapshot stored in external storage (not on the control plane node)
- [ ] Cluster resource manifests exported (kubectl get all --all-namespaces -o yaml)
- [ ] Persistent volume snapshots taken for critical data
- [ ] Restore procedure tested in an isolated environment
Store this checklist in your team wiki or runbook repo and update it after every upgrade. Each upgrade teaches you something new, add it to the checklist so the next upgrade benefits from the lesson.
Part 3: Communication and Coordination
An upgrade is not just a technical operation. It involves people.
Stakeholder Notification
Send advance notice to these groups:
| Audience | When | What to Tell Them |
|---|---|---|
| Engineering teams | 2 weeks before prod upgrade | "Kubernetes upgrade on [date]. Deploy freeze starts [date]. Expected duration: [X hours]." |
| SRE/On-call team | 1 week before | Full runbook, rollback criteria, escalation contacts |
| Management | 1 week before | Timeline, risk assessment, business impact if things go wrong |
| Security team | After completion | "Upgrade complete. Now on [version]. Compliance requirement met." |
Deploy Freeze
Implement a deploy freeze starting 24-48 hours before the production upgrade:
- No new deployments to production
- No Helm releases
- No CRD updates
- No admission webhook changes
The deploy freeze exists because you need a stable baseline. If someone deploys a broken app during the upgrade, you will waste hours figuring out whether the breakage is from the upgrade or the deployment.
Deploy freezes are unpopular. Engineers will push back, especially if they have features waiting to ship. Hold the line. A deploy freeze of 48 hours is far less disruptive than a failed upgrade that takes down production for a day because someone deployed an incompatible webhook during the maintenance window.
Maintenance Window
Define a maintenance window with these parameters:
Maintenance Window:
- Start: Saturday 02:00 UTC (lowest traffic period)
- Duration: 4 hours (control plane) + rolling node upgrades through Sunday
- Expected impact: Pods will be rescheduled during node drains (brief disruptions)
- Communication channel: #kubernetes-upgrade (Slack)
- Incident commander: [Name]
- Decision maker for rollback: [Name]
Part 4: Rollback Criteria
Define your rollback criteria before you start the upgrade. If you wait until things are breaking to decide whether to roll back, you will spend too long debating and not enough time acting.
Rollback Triggers
Define specific, measurable conditions that trigger a rollback:
ROLLBACK IF ANY OF THESE ARE TRUE:
1. kube-apiserver is unreachable for more than 5 minutes after upgrade
2. More than 10% of pods enter CrashLoopBackOff within 30 minutes
3. Any critical workload (defined list) is down for more than 10 minutes
4. etcd cluster loses quorum
5. Node NotReady count exceeds 20% of total nodes
6. The upgrade process itself fails midway and cannot be resumed
Rollback Procedure
For kubeadm clusters, rollback means:
- Control plane rollback: Restore etcd from pre-upgrade snapshot, reinstall previous version of kubeadm/kubelet/kubectl, run
kubeadm initwith the old version config - Node rollback: Reinstall previous kubelet version, restart kubelet service
A team once defined their rollback criteria as "if anything goes wrong." During the production upgrade, a single non-critical monitoring pod went into CrashLoopBackOff because of a minor API incompatibility. They rolled back the entire cluster, losing 6 hours. The fix for the monitoring pod was a one-line manifest change. Define specific, severity-based rollback criteria, not "anything goes wrong."
The person who decides to rollback should be named in advance. During an incident, "who decides?" wastes precious minutes. Designate a single decision maker (typically the on-call lead or SRE manager) and give them authority to call it.
Part 5: The Three Upgrade Strategies
There are three approaches to upgrading a cluster. Each has trade-offs.
Strategy 1: In-Place Upgrade
Upgrade the existing cluster components in place. This is what kubeadm upgrade does.
Pros: Simpler, no duplicate infrastructure, preserves cluster identity and state Cons: Rollback is harder (restore from backup), brief pod disruptions during node drains Best for: Most clusters, especially smaller ones
# In-place upgrade with kubeadm
kubeadm upgrade apply v1.30.3
# Then upgrade kubelets node by node
Strategy 2: Blue-Green Cluster
Build an entirely new cluster at the target version. Migrate workloads from old (blue) to new (green). Cut over DNS/load balancer when ready.
Pros: Zero-risk rollback (just point traffic back to blue), clean cluster with no upgrade artifacts Cons: Double the infrastructure cost during migration, complex workload migration, stateful workloads are hard to move Best for: Large production clusters where downtime is unacceptable and budget allows
Strategy 3: Rolling Node Replacement
Upgrade the control plane in-place, but replace worker nodes instead of upgrading them. Add new nodes at the target version, drain and remove old nodes.
Pros: Worker nodes are clean (no upgrade artifacts), easier than full blue-green Cons: Requires spare capacity to add nodes before removing old ones, control plane still upgraded in-place Best for: Cloud environments where nodes are disposable (ASGs, node pools)
In-Place vs Blue-Green Upgrade
In-Place Upgrade
Upgrade components on existing cluster
Blue-Green Cluster
Build new cluster, migrate workloads
Part 6: Building the Timeline
Here is a concrete timeline template for a single-hop upgrade across three environments:
=== KUBERNETES UPGRADE: 1.29 → 1.30 ===
WEEK 1: Preparation
Mon-Tue: Deprecated API audit and remediation
Wed: etcd backup procedure tested
Thu: Staging cluster backup taken
Fri: Upgrade dev cluster (control plane + nodes)
WEEK 2: Staging
Mon: Review dev upgrade results, fix any issues
Tue: Upgrade staging control plane
Wed-Thu: Upgrade staging worker nodes (rolling)
Fri: Begin staging soak test
WEEK 3: Staging Soak + Prod Prep
Mon-Wed: Continue staging soak test, run integration tests
Thu: Production pre-upgrade checklist complete
Fri: Send stakeholder notification, confirm maintenance window
WEEK 4: Production
Mon: Deploy freeze begins
Tue: Take production etcd backup, verify restore
Wed: Maintenance window: upgrade production control plane
Thu-Fri: Rolling upgrade of production worker nodes
Sat: Deploy freeze lifted, post-upgrade monitoring
WEEK 5: Verification
Mon-Fri: Monitor production for any delayed issues
Fri: Upgrade declared complete, retrospective scheduled
Add buffer time. If the timeline says "4 weeks," plan for 5. Upgrades always take longer than expected: a failed node drain, a stubborn PDB, an operator that needs a manual update. The buffer is not slack time; it is reality.
Part 7: Risk Assessment
Document what can go wrong at each phase and what you will do about it:
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Deprecated API breaks workload | Medium | High | Pre-upgrade API audit (Module 2, Lesson 1) |
| etcd backup is corrupt | Low | Critical | Test restore before every upgrade |
| Admission webhook rejects new API fields | Medium | Medium | Test webhooks in staging |
| Node drain takes too long (PDB blocks) | High | Low | Set drain timeout, manually review stuck PDBs |
| Third-party operator incompatible | Medium | High | Check operator release notes for target K8s version |
| Control plane upgrade fails midway | Low | Critical | etcd restore to pre-upgrade state |
| kubelet fails to start after upgrade | Low | Medium | Reinstall previous kubelet version on affected node |
| Network plugin incompatible | Low | Critical | Verify CNI plugin supports target version before starting |
Part 8: The Upgrade Runbook Template
Use this as a starting point for your team's runbook:
# Kubernetes Upgrade Runbook: [CURRENT] → [TARGET]
## Pre-Upgrade
1. Confirm current versions: `kubectl version`, node kubelets
2. Complete deprecated API audit
3. Take and verify etcd backup
4. Confirm target version packages available
5. Notify stakeholders
6. Begin deploy freeze
## Control Plane Upgrade
1. Upgrade kubeadm: `apt-get install kubeadm=[TARGET_VERSION]`
2. Verify upgrade plan: `kubeadm upgrade plan`
3. Apply upgrade: `kubeadm upgrade apply v[TARGET_VERSION]`
4. Verify API server health: `kubectl get nodes`
5. Upgrade kubelet on control plane nodes
6. Restart kubelet: `systemctl restart kubelet`
7. Verify control plane pods: `kubectl get pods -n kube-system`
## Worker Node Upgrade (per node)
1. Cordon node: `kubectl cordon [NODE]`
2. Drain node: `kubectl drain [NODE] --ignore-daemonsets --delete-emptydir-data`
3. Upgrade kubeadm: `apt-get install kubeadm=[TARGET_VERSION]`
4. Upgrade node config: `kubeadm upgrade node`
5. Upgrade kubelet and kubectl packages
6. Restart kubelet: `systemctl restart kubelet`
7. Uncordon node: `kubectl uncordon [NODE]`
8. Verify node ready: `kubectl get node [NODE]`
9. Wait for pods to reschedule and become healthy
10. Proceed to next node
## Post-Upgrade Verification
1. All nodes at target version: `kubectl get nodes`
2. All system pods healthy: `kubectl get pods -n kube-system`
3. Application pods healthy across all namespaces
4. Run smoke test suite
5. Check monitoring dashboards for anomalies
6. Lift deploy freeze
7. Send completion notification
## Rollback Procedure
1. If control plane fails: restore etcd, reinstall previous kubeadm/kubelet
2. If worker node fails: reinstall previous kubelet, restart
3. Decision maker for rollback: [NAME]
4. Rollback triggers: [LIST FROM ROLLBACK CRITERIA]
A runbook is not documentation, it is an executable checklist. Every step should be a concrete command or action, not a vague instruction like "verify things are working." If someone who has never done the upgrade before cannot follow the runbook step by step and succeed, it is not detailed enough.
Key Concepts Summary
- Always upgrade in order: dev, staging, production, never skip environments
- A deploy freeze before production upgrades prevents debugging confusion from concurrent changes
- Define rollback criteria before starting: specific, measurable conditions, not "if anything goes wrong"
- Name a single rollback decision maker to avoid wasting time debating during an incident
- Three strategies: in-place, blue-green, rolling node replacement, in-place with kubeadm is the most common
- Multi-hop upgrades multiply the timeline, each hop needs its own full testing cycle
- Build buffer into your timeline: upgrades always take longer than planned
- The runbook must be executable: concrete commands, not vague instructions
- Communication is part of the plan: stakeholders, deploy freezes, and maintenance windows are not optional
Common Mistakes
- Starting with production because "we are behind and need to catch up fast", this is how outages happen
- Not defining rollback criteria before starting, leads to paralysis during incidents
- Setting a deploy freeze that is too short, 24 hours minimum, 48 hours preferred
- Skipping the staging soak test to save time, staging catches the issues that appear after hours, not minutes
- Writing a runbook with vague steps like "verify cluster health" instead of specific commands
- Not testing the etcd restore procedure, a backup you have not tested is not a backup
- Underestimating the time for multi-hop upgrades, 2 hops is not 2x the work, it is closer to 3x
- Forgetting to check third-party operator compatibility: CNI plugins, ingress controllers, and cert-manager all have K8s version requirements
You need to upgrade from Kubernetes 1.28 to 1.31 across dev, staging, and production environments. What is the minimum number of separate upgrade operations you need to perform?