Production Kubernetes Operations

Your Cluster Is Not Production-Ready

Every team's Kubernetes journey has the same moment. You got the cluster up, deployed your first workload, saw the pods in Running state, and said "we're on Kubernetes." A month later something broke in a way the tutorials never mentioned: a node ran out of disk, a certificate expired, a developer accidentally deleted the production namespace, and you discovered the gap between "it works" and "it's production-ready."

This lesson is that gap, written out. The checklist nobody teaches in the certification courses, the things that separate a cluster that survives contact with real traffic from one that doesn't.

KEY CONCEPT

"My pods are Running" is not "my cluster is production-ready." Production-ready means: a human can leave for a week and come back to a working cluster. Under that definition, most clusters you inherit will score under 30%. This lesson is the scoring rubric.

The nine pillars

A production-ready Kubernetes cluster needs all of these. Missing any one is a known unknown waiting to become an outage.

We'll take them one at a time.

Pillar 1: Identity and access

What's production-ready:

Every human accesses the cluster through OIDC / SSO, not a static kubeconfig floating in a shared drive.
Service accounts have scoped cloud identity (IRSA on EKS, Workload Identity on GKE, Azure AD on AKS).
RBAC roles are named, versioned, and reviewed quarterly.
kubectl delete is auditable back to a specific person, not "the shared admin token."

What most clusters have:

A single admin.conf passed around in Slack.
Pods using cloud credentials mounted as raw Secrets.
ClusterAdmin for every developer because "fixing permissions is hard."
No idea who deleted the production deployment last Tuesday.

The gap between those two is the difference between "when this developer leaves we reset a credential" and "when this developer leaves we're auditing 18 months of unattributed actions."

Pillar 2: Backup and disaster recovery

What's production-ready:

Automated etcd snapshots every hour, shipped to object storage (S3, GCS, Azure Blob) with cross-region replication.
Velero running scheduled backups of cluster state (Deployments, ConfigMaps, PVs).
A written, tested restore procedure: with a number for RTO (how fast can we recover?) and RPO (how much can we lose?).
A cluster DR drill at least once a quarter.

What most clusters have:

"We have backups somewhere, I think."
The backups have never been tested.
No one on the team knows how long a restore takes.
The backups are in the same region as the cluster, so a regional outage loses both.

Backup discipline is the difference between "the region is down for 4 hours" and "we lost four months of production state."

Pillar 3: Monitoring and alerting

What's production-ready:

A metrics stack that covers the cluster layer (Prometheus scraping kubelet, cadvisor, kube-state-metrics) AND the workload layer (app metrics).
Logs centralized off-cluster (Loki, ELK, or cloud logging) so a node crash doesn't lose logs.
SLO-based alerts that fire on user-visible pain (error rate, tail latency), not just "CPU is 80%."
Dashboards an on-call engineer can read at 3am without three months of context.

What most clusters have:

Prometheus running somewhere, dashboards no one looks at.
Alerts that fire constantly but only on infrastructure metrics nobody cares about.
Critical app signals invisible: no tracing, no meaningful logs.

If nothing pages during an incident but users are complaining, your monitoring has failed you. Covered in detail in Module 7.

Pillar 4: Upgrades and lifecycle

What's production-ready:

A runbook for upgrading the cluster from version N to N+1, tested in staging, reproducible.
An add-on compatibility matrix tracking CNI, CSI, ingress, monitoring components.
Certificate rotation automated or tracked with alerts 60 days before expiry.
Node OS patching on a cadence (monthly minimum).

What most clusters have:

"We're still on Kubernetes 1.26 because upgrades are scary."
Certs that expire randomly causing outages no one connects to the cert.
Add-ons that haven't been updated since provisioning because no one knows if it's safe.

Upgrade cowardice compounds. A 5-version-behind cluster is exponentially harder to upgrade than a 1-version-behind one. Module 8 covers this in full.

WARNING

The single biggest sin in Kubernetes operations is letting a cluster drift many versions behind current. You can't skip. The upgrade path from 1.26 → 1.31 is five separate upgrades, each with its own breaking changes. Falling behind is not neutral, it's an accumulating technical debt with a cliff ending.

Pillar 5: Multi-tenancy and isolation

What's production-ready:

Namespaces with resource quotas (CPU, memory, PVC count, pod count).
Network policies that default-deny and allow-list actual traffic.
Per-namespace service accounts with scoped permissions.
Admission controllers (OPA Gatekeeper, Kyverno) enforcing policy: no latest tags, no root containers, required labels.

What most clusters have:

Namespaces as labels only. No quotas, no network policies.
Any pod can talk to any other pod.
One team's runaway workload can take down the cluster.

Multi-tenancy without isolation isn't multi-tenancy, it's shared fate. Covered in Modules 3 and 5.

Pillar 6: Cost tracking

What's production-ready:

The monthly cloud bill is broken out by namespace, team, and/or product.
Unusual spend triggers an alert (20%+ month-over-month per namespace).
Resource requests are sized from real observed usage, not guesses.
Unused workloads are identified and decommissioned (not running pods paying for idle GPU for nine months).

What most clusters have:

A CFO asking "why is the Kubernetes bill growing?" and no answer.
Pods with resources: {} (no requests, no limits) running untuned forever.
Dev workloads running 24/7 because "we might need them later."

Cost without attribution means no feedback loop, teams keep over-provisioning because the cost lands on the platform team, not them. Module 9.

Pillar 7: Scaling under load

What's production-ready:

Horizontal pod autoscalers on user-facing workloads, tuned to real signals (request rate, queue depth).
Cluster autoscaler (or Karpenter) provisioning nodes as needed.
Pod Disruption Budgets so that autoscaling events don't take down too many replicas at once.
Load tests that reflect production traffic shape.

What most clusters have:

Fixed replica counts tuned to last year's peak.
No cluster autoscaling: you manually scale when pods don't fit.
PDBs never configured, so rollouts take down the whole fleet.

If you can't handle a 3x traffic spike without human intervention, your scaling isn't production-ready. Module 6.

Pillar 8: Security hardening

What's production-ready:

Pod Security Standards at baseline or restricted (no root, no privileged, no hostPath).
Image signing / admission (cosign, notary) ensuring only trusted images run.
Secrets from a real secrets manager (Vault, AWS Secrets Manager), not inline in YAML.
CVE scanning in CI and on running workloads.
Kubernetes audit logs shipped to a SIEM.

What most clusters have:

Everything runs as root because it "worked on my laptop."
Secrets stored in Git (base64 is not encryption).
Images pulled from :latest of a random public tag.

Security is a compounding failure mode, a cluster that's insecure today is easier to compromise tomorrow as more code lands on it.

Pillar 9: Incident response

What's production-ready:

Written runbooks for common failures: node NotReady, PVC stuck in Pending, ingress 5xx spikes, etcd quorum loss.
An on-call rotation with escalation.
Postmortem culture: incidents get documented, action items ship.
Game days / chaos engineering practice at least quarterly.

What most clusters have:

The one engineer who "knows Kubernetes" gets paged for everything.
Runbooks that haven't been touched since the cluster was provisioned.
No postmortems, so the same incident keeps happening.

Incident response isn't a tool, it's a practice. Module 10 covers the drill cadence.

The scoring exercise

Run through your current cluster. Score each pillar 0 to 3:

0: not even a stub. We haven't thought about this.
1: something exists but it's not tested, not reliable, or not consistent.
2: solid on the happy path. Handles normal cases.
3: production-grade. Tested, monitored, repeatable, survives the person who built it leaving.

Total out of 27. Most teams score 8-12 on a cluster they'd honestly described as "production." That's not judgment, it's a map of the next quarter of work.

PRO TIP

Do this scoring with your team, openly. The goal is not to score well, the goal is to surface the gaps. A cluster that scores 10/27 with everyone seeing the same gaps is safer than one that scores 20/27 in the lead engineer's head and 5/27 from everyone else's view.

Why certifications don't teach this

The CKA and CKAD exams are excellent at teaching "how to use Kubernetes." They're essentially zero help on "how to operate a production cluster." The certifications cover:

How to create a Deployment (Day 0).
How to debug a failing pod (Day 1).
Almost nothing about Day 2: upgrades, backups, cost, multi-tenancy, incident response.

This is not a knock on the certifications, they're designed as entry-level credentials. But an engineer who has passed CKA and calls a cluster they built "production-ready" is almost always wrong. The certification didn't cover what they missed.

This course is the operations knowledge the certifications skip.

The order things fail

If you had to predict the order in which your cluster will have its first production incident, here's a rough empirical ordering from teams I've worked with:

Certificate expires (weeks-to-months after provisioning). Silent, catastrophic, easy to miss.
Disk pressure on a node kills pods (first few months). No one set up monitoring, nobody noticed until an outage.
One team's runaway workload consumes the cluster (3-6 months in). No resource quotas.
An upgrade that can't be done (6-12 months in). Too far behind, too many add-ons to coordinate.
Data loss from a stateful workload (varies, depends on luck). Backups weren't tested.

Each of these has a specific lesson in this course. If you've already experienced one, you know how painful they are; if you haven't, you're coming.

The compounding value of getting it right

Every pillar you solve compounds. Proper RBAC + audit logs means incidents get root-caused faster. Monitoring + alerting means issues are caught earlier. Upgrade discipline means you don't accumulate version debt that makes all future changes harder.

A team that invests in production-readiness early moves faster in year two, not slower. A team that defers it ends up rebuilding the cluster in year three because the operational debt isn't survivable.

KEY CONCEPT

Production readiness is not a gate you pass through, it's a continuous practice. Clusters decay. Add-ons bitrot. Certs expire. Teams turn over. What you build this quarter needs to be operable next quarter by someone who wasn't there.

What this course commits to teaching

Across the next nine modules, we'll cover each pillar in depth with:

Concrete configuration patterns that work across EKS, GKE, and AKS.
The specific failure modes for each pillar and how to detect them early.
Real production numbers: resource allocations, backup frequencies, alert thresholds, not theoretical ranges.
Runbooks you can clone into your own ops wiki.

You'll end this course with a cluster scorecard that's moved from single-digit to 20+ without cargo-culting "best practices" that don't match your team's actual scale.

Quiz

KNOWLEDGE CHECK

A new engineer joins your team. They look at the cluster, see that all pods are Running and the dashboard is green, and declare it production-ready. What is the single best counter-question to ask them?

What to take away

Production-ready means a week without you doesn't break things. Measure yourself against that bar.
Nine pillars: identity, backup/DR, monitoring, upgrades, multi-tenancy, cost, scaling, security, incident response.
Most "production" clusters score 8-12 out of 27. That's not shameful, it's a roadmap.
The order of first failures is predictable: cert expiry, disk pressure, runaway workload, stuck upgrade, data loss.
Certifications teach Day 0 and Day 1. The rest of this course is Day 2, the part that actually matters for operations.
Readiness compounds. Investing early pays back as a moat; deferring it becomes unsurvivable technical debt.

Next lesson: the managed-vs-self-managed trade-off. What EKS/GKE/AKS actually do for you, what they don't, and the operational cost hidden in each choice.

Continue

Managed vs Self-Managed: The Real Trade-offs

←→ navigateM toggle sidebar