Production Kubernetes Operations

Cluster Lifecycle Thinking

Every Kubernetes cluster has three phases of life: the day you provisioned it, the day you ran your first workload on it, and every day after that. The first two are what the tutorials and blog posts cover. The third is what dominates the cost of ownership.

This lesson is about recognizing which phase you're actually in and investing accordingly. Teams that don't think in these terms build clusters that launch fast and age badly, polished Day 0 configurations coupled with nonexistent Day 2 operations.

KEY CONCEPT

Day 0 is provisioning. Day 1 is deployment. Day 2 is everything else: and Day 2 lasts years. Most Kubernetes cost, most incidents, and most engineering time live in Day 2. If your cluster was designed for a smooth Day 1 but painful Day 2, you'll be rebuilding in 18 months.

The three phases: concretely

Day 0: before the cluster exists

Choose: managed vs self-managed, which cloud, what scale.
Design: network topology, IAM model, upgrade strategy.
Provision: Terraform/Pulumi/console, spin up the cluster.
Install: CNI, CSI, ingress, monitoring, secrets manager.

Duration: days to weeks. Reversibility: mostly reversible; you can rebuild if you don't like it.

Day 1: first workloads

Deploy first app.
Wire CI/CD.
Set up initial RBAC, namespaces, quotas.
Connect to databases, caches, external services.
Validate end-to-end: user can hit the service and get a response.

Duration: days to weeks. Reversibility: still high; you're not yet under load or traffic.

Day 2: operations, forever

Debug the 3am pager.
Upgrade through five Kubernetes versions.
Onboard ten new teams.
Survive the cloud provider's regional outage.
Absorb a 10× traffic growth.
Cut costs 30% under CFO pressure.
Recover from the disk failure, the cert expiry, the bad deploy, the resource quota breach.

Duration: years. Until the cluster is retired. Reversibility: lowest. You can't "start over" without disrupting users.

The time distribution that nobody mentions

That ratio is empirically measurable. I've asked platform teams to track time by phase for 6 months and it comes out in that range every time. Yet most of the content ecosystem (blog posts, tutorials, workshop agendas, certifications) is focused on Day 0 and Day 1, because that's what's novel and what can be demoed.

Why Day 0 decisions echo through Day 2

The decisions you make in Day 0 propagate throughout Day 2, for better or worse. A few examples:

1. Network topology

Choose pod CIDR too small → can't scale past N nodes → Day 2 emergency.

Example: AWS EKS with VPC CNI, default /16 network, 254 usable /24 subnets. At ~30 pods per node, you hit IP exhaustion around 3000 nodes. Sounds like a lot until your workload does 5x in a year. Re-IPing a live cluster is painful; picking /12 on Day 0 is trivial.

2. Availability zones

Provision in one AZ → when that AZ goes down, the cluster's gone.

Spreading across three AZs on Day 0 is a checkbox. Doing it on Day 2 after you've inherited a single-AZ cluster is a rebuild.

3. Identity model

IAM integrated with cloud workload identity (IRSA, Workload Identity, AAD) on Day 0 → you can scope pod permissions cleanly from the start.

IAM with long-lived static credentials on Day 0 → every new workload inherits the bad pattern and you're stuck with it.

4. Naming conventions and labels

Plan labels (team, cost-center, environment, service) on Day 0 → cost attribution, policy enforcement, and debugging all work naturally.

Figure out labels later → you're retroactively labeling 500 workloads while trying to explain to finance why the bill is unattributable.

5. Cluster size

One cluster for everything → cheap at start, multi-tenancy pain later.

Cluster per team → expensive at start, operational overhead compounds.

Neither is wrong; but the choice locks you into certain Day 2 patterns.

WARNING

The single biggest architectural mistake is choosing Day 0 defaults based on "what's easy to demo" rather than "what's easy to operate for 3 years." The demo runs for an hour; the cluster runs for years.

The Day 2 operational areas

Everything this course teaches, organized by when it matters:

Constant ongoing work

Monitoring, alerting, on-call.
Incident response.
Cost tracking.
Security hardening (CVE management, patches).

Cadenced work

Cluster upgrades (quarterly).
Add-on upgrades (monthly-ish).
Certificate rotation (annually).
Backups (hourly/daily).
DR drills (quarterly).

Event-driven work

New team onboarding.
Workload performance debugging.
Node failures.
Scale events.
Cloud provider incidents.

Most of this is Day 2. Much of it is invisible in Day 0 planning. This course is structured around what matters in Day 2.

The "finished" fallacy

Teams at the end of Day 1 often feel done. The cluster is up, workloads are running, metrics are flowing. Easy to declare victory.

The cluster isn't finished, it's just born. You've crossed into Day 2.

Three signs you're in Day 2 but still thinking in Day 0/1 terms:

You don't have an upgrade plan. Day 0 thinking: "we'll upgrade when we need to." Day 2 thinking: "we upgrade N+1 within 6 months of release, here's the runbook."
You haven't done a DR drill. Day 0: "we have backups." Day 2: "we restored the backup to a scratch cluster last Tuesday; it took 43 minutes."
Your monitoring is just CPU/memory. Day 0: "we have Prometheus." Day 2: "we alert on SLO burn rate; pages fire before customers notice."

PRO TIP

A useful exercise: look at what your team spent time on in the last 30 days. If >80% was Day 0/1 (new features, new services) and you've had the cluster for >6 months, you're accumulating Day 2 debt. Rebalance.

Day 2 cost drivers

The work that dominates Day 2 engineering time:

Incident response alone is a quarter to a third of Day 2 time. That's why proper monitoring (which reduces unnecessary pages) and runbooks (which accelerate resolution) pay off so much.

What Day 2 looks like at different maturities

Immature Day 2 (the first year)

Every incident is a surprise.
Every upgrade is a project.
Every new team is bespoke onboarding.
Cost is unattributed.
On-call is tribal knowledge.
Everything is ad hoc.

Mature Day 2 (year 2-3)

Incidents follow runbook patterns; most resolve in under 30 min.
Upgrades are rolling, scheduled, expected.
New teams onboard through templated pipelines.
Cost is visible per-team; anomalies trigger alerts.
On-call is rotated, documented, practiced.
Change is safe because testing and rollback are cheap.

Very mature Day 2 (year 3+, large platforms)

Self-service cluster access for teams.
Automated capacity planning.
Chaos engineering in production.
Zero-touch cluster lifecycle (auto-upgrades with validation).
Platform engineering team focused on developer productivity, not firefighting.

Most teams never reach the third stage. The second stage is the realistic ceiling for most organizations and is perfectly sufficient.

Planning for Day 2 during Day 0

The practical exercise: before you provision, answer these.

1. Who owns this cluster in year 2?

Not the person who provisioned it. The person who inherits it. Design for them.

2. What's your upgrade cadence?

Continuous with every minor release?
Quarterly?
Annually?

Pick one. Design the cluster to support it. (No, "whenever we have time" is not an answer, that means never.)

3. Who gets paged?

Dedicated platform on-call rotation?
Shared with application on-call?
One hero engineer?

The answer influences documentation investment, runbook quality, and monitoring setup.

4. How do you measure success?

Uptime SLO?
Deploy frequency per team?
Cost per service?
Time to resolution?

If you don't measure, you can't improve. Pick your metrics on Day 0 so you start collecting data immediately.

5. What's the DR story?

Regional outage survivability?
Data loss tolerance (RPO)?
Recovery time tolerance (RTO)?

These determine whether you need multi-region, how often you backup, and how much you invest in DR drills.

The worst Day 2 scenarios

Specific nightmares I've seen multiple teams hit, each traceable to Day 0/1 skimping:

"We can't upgrade without downtime"

Cause: PDBs not configured, workloads not tolerant to rolling restarts, no test cluster. Day 2 cost: upgrades are multi-hour outage windows, scheduled rarely, always stressful.

"The cluster is expensive and we can't tell why"

Cause: no labels on workloads, no cost attribution, no resource request discipline. Day 2 cost: every quarterly budget review is a fire drill.

"One engineer knows how this works"

Cause: tribal knowledge never documented, no runbook culture. Day 2 cost: that engineer is on vacation, something breaks, hours of lost productivity.

"We've been on 1.26 for two years"

Cause: upgrades were deferred because "we're too busy." Day 2 cost: 5-version upgrade is now a multi-quarter project.

"We don't have backups we've tested"

Cause: backups configured, never validated. Day 2 cost: during an actual incident, restoring fails, real data lost.

Each of these has a specific solve in this course. The common thread is: Day 2 emergencies usually trace to Day 0 shortcuts.

Where Day 0 and Day 2 connect

A useful mental exercise. Every time you're about to make a Day 0 choice, ask: "what does this force in Day 2?"

"Let me use a default IP range" → what does rebuilding look like if we outgrow it in Day 2?
"Skip the cert automation for now" → when will certs expire and how will we know?
"One big shared namespace" → what's the migration path when teams need isolation?
"Monitor only Prometheus basics" → what incidents will we miss that we wish we'd caught?

Making Day 2 work slightly visible during Day 0 is often the difference between "we built it right" and "we're rebuilding."

KEY CONCEPT

A useful shorthand: "What happens when the person who built this leaves in 18 months?" If the answer is "we're screwed" or "we rebuild," you're making Day 0 choices that won't survive Day 2.

The exit plan

Every cluster has a retirement date, even if you don't know when. A few teams plan for it:

Decommission runbook: what steps to tear down the cluster.
Workload migration plan: how services move to the next cluster.
Data archival: what's kept, where, for how long.

Most don't. But thinking about it forces you to notice whether your cluster is designed as one of many (replaceable, easy to migrate from) or one of one (everything depends on it, can't move off).

How this course maps to Day 2

Every module past this one is Day 2 content:

Module 2: Provisioning (Day 0): but with Day 2 implications explicit.
Module 3: Identity: a Day 2 maintenance area.
Module 4: Storage: where most Day 2 workload failures live.
Module 5: Networking: the #1 Day 2 debugging area.
Module 6: Scaling: pure Day 2 operations.
Module 7: Monitoring + debugging: the Day 2 feedback loop.
Module 8: Upgrades: the Day 2 responsibility that compounds.
Module 9: Cost: Day 2 economics.
Module 10: DR: the Day 2 safety net.

When each module opens, keep this lesson in mind: the content isn't academic. It's the specific Day 2 skill your cluster needs for the next 5 years.

Quiz

KNOWLEDGE CHECK

Your team has been running a Kubernetes cluster for 18 months. You're spending 70% of your engineering time on new provisioning projects (Day 0 work) and 30% on operations. Your cluster has had three surprise outages in the last quarter. What's the most likely structural issue?

What to take away

Three phases: Day 0 (provisioning), Day 1 (deployment), Day 2 (operations). Day 2 lasts years.
Day 2 consumes 85-93% of engineering effort across a cluster's life. Plan and budget accordingly.
Day 0 decisions echo into Day 2. Network topology, AZ spread, IAM, labels, cluster count, all have multi-year consequences.
Incident response, onboarding, upgrades, monitoring, cost, security, documentation, these are Day 2's bread and butter.
Mature Day 2 looks boring: predictable, repeatable, documented. If your Day 2 feels exciting, you're under-invested.
Plan for the person who inherits the cluster, not yourself on Day 0.
Everything in the rest of this course is Day 2 content. Use this framing.

Next module: provisioning done right, the Day 0 choices that don't make your Day 2 miserable.

Managed vs Self-Managed: The Real Trade-offs

Continue

Provisioning Self-Managed Clusters

←→ navigateM toggle sidebar