Production Kubernetes Operations

Managed vs Self-Managed: The Real Trade-offs

Pick the wrong Kubernetes flavor and you'll spend the next two years either paying a premium for something you don't need, or discovering operational surprises your vendor quietly didn't handle. The marketing materials are identical; the reality is not.

This lesson is about the decision framework that actually maps to your situation. What the big three managed services (EKS, GKE, AKS) really manage, what they leave to you, and when self-managed is genuinely the right call despite what everyone says about "undifferentiated heavy lifting."

KEY CONCEPT

Managed Kubernetes is not "Kubernetes without operations", it's "Kubernetes with fewer operations in specific layers." The control plane is handled. The nodes, networking, storage, upgrades, monitoring, and security are mostly not. Pick managed for the right reasons, not because you assumed it's turnkey.

What "managed" actually means

Each provider draws the line slightly differently. A rough shared baseline:

Reading this: the managed provider takes ~20% of the operational surface area (the control plane) and leaves ~80% (everything user-facing) for you. That's still valuable, the 20% is genuinely hard, but it's a smaller win than most teams assume.

What you specifically save with managed

The concrete wins, in priority order:

1. etcd operations go away

Running etcd in production is genuinely hard. Three-member quorum management, backup automation, disk-performance tuning, TLS rotation, all covered in detail in the dedicated etcd course. If you use managed Kubernetes, that entire course is knowledge you don't need to apply daily. That's real value.

2. Control-plane HA is free

On a managed service, the API server is redundant across at least three availability zones. You get that without architecting anything, just a checkbox during provisioning.

3. Control-plane upgrades are one command

EKS, GKE, and AKS each let you upgrade the control plane with a single aws eks update-cluster-version / gcloud container clusters upgrade / az aks upgrade call. They handle the coordination internally.

Self-managed, you need to coordinate the kube-apiserver, controller-manager, scheduler, and kubelet upgrades across every node. It's doable (the kubeadm upgrades course covers it) but it's a multi-hour operation.

4. Baseline security patches on the control plane

If CVE-2024-Whatever lands in kube-apiserver, your provider patches it in hours. Self-managed, you do it yourself, and you need to know it exists.

What you specifically still own

1. Node lifecycle (all of it)

Node provisioning is managed in the sense that EKS-managed node groups, GKE node pools, and AKS scale sets automate instance creation. But:

OS patching on nodes: still yours. You either use provider-managed AMIs (patch on release) or roll your own.
Node upgrades: still yours. Draining nodes during cluster upgrades is coordinated but not fully automatic in the ways that surprise teams.
Right-sizing: still yours. The provider doesn't pick node sizes for you.

2. Workload-level concerns

Everything a user runs on the cluster is yours: RBAC, resource quotas, network policies, pod security, secrets, ingress, monitoring. The provider touches none of it.

3. Monitoring of the cluster

Most managed services give you basic control-plane metrics through CloudWatch / Cloud Monitoring / Azure Monitor. Workload observability: Prometheus, Grafana, Loki, tracing, you run.

4. Backup and DR

This is the one most teams miss. Managed Kubernetes backs up etcd, not your workloads. If a developer deletes a namespace, the etcd snapshot won't help, the state is already committed. You still need Velero, GitOps, or application-level backup.

WARNING

The biggest misconception about managed Kubernetes: "we don't need a backup strategy, the provider handles it." The provider backs up the cluster's bookkeeping (etcd) but does NOT provide user-visible time-machine-style rollback of your workloads. If that matters, you run Velero (or GitOps) regardless of managed vs self-managed.

The three managed services, honestly compared

EKS (AWS)

Strengths:

Widest AWS ecosystem integration (IAM, VPC, ALB, CloudWatch).
EKS Auto Mode (released 2024) moves much more to managed, closer to GKE-level automation.
Large community, lots of Terraform modules.

Rough edges:

IAM integration (IRSA) is powerful but verbose. Lots of YAML + trust policies.
VPC CNI has historically been "odd": each pod gets an ENI-attached IP, which is fine until you run into ENI limits.
Node group vs Fargate vs self-managed nodes is three overlapping options; choosing is confusing.

Cost: control plane $0.10/hr (~$73/mo per cluster) + nodes + data transfer.

GKE (Google Cloud)

Strengths:

Most mature managed Kubernetes: Google ran this at scale before others.
GKE Autopilot is the most hands-off option available across the three (you literally don't manage nodes).
Networking "just works": Google's backbone, anycast, clean VPC integration.

Rough edges:

GCP is a smaller ecosystem overall than AWS. Third-party integrations sometimes lag.
Autopilot's per-second pricing can surprise small teams with many small pods.
The "GKE vs Autopilot vs Standard" choice maps to "managed vs very managed vs classic", easy to pick wrong.

Cost: control plane free for one zonal cluster per account, then $0.10/hr for regional clusters.

AKS (Azure)

Strengths:

Free control plane on all clusters (AKS only charges for nodes and bandwidth).
Tight integration with Azure AD, Azure Policy, Azure Monitor.
Good Windows container support for teams that need it.

Rough edges:

Historically rougher around upgrade coordination than EKS/GKE. Has improved substantially.
Azure CNI has its own quirks (especially around IP address exhaustion).
Add-on ecosystem is less mature than AWS/GCP in Kubernetes specifically.

Cost: control plane free, pay for nodes. Often the cheapest of the three at small scale.

When self-managed still makes sense

Despite the obvious appeal of managed, there are cases where self-managed is the right choice:

1. On-prem / hybrid requirements

If you have data center workloads that have to stay in the data center (regulatory, data gravity, performance), you're self-managed by necessity. EKS Anywhere, GKE Enterprise, and Azure Arc offer hybrid options but the operational reality is closer to self-managed than pure managed.

2. Strict data residency and sovereignty

Some regulated industries and jurisdictions require that the entire cluster, control plane included, runs in your own accounts under your own keys. Managed services are improving here (sovereign clouds, bring-your-own-key) but requirements sometimes push you to fully self-managed.

3. Customization beyond what providers allow

If you need custom admission controllers injected at the apiserver level, custom scheduling plugins, or modifications to kubelet that aren't supported, managed providers won't let you. Most teams don't need this.

4. Very large scale

At the high end (thousands of nodes, tens of thousands of pods), managed offerings can hit internal limits or become economically disadvantageous. If you're at that scale, you're also almost certainly at a size where self-managed is operationally feasible.

5. Cost at extreme scale

The control plane fees ($73/mo per EKS cluster) are negligible at small scale and negligible-ish at large scale. But if you're running many small clusters (say, one per team × 50 teams), it adds up to real money. A single self-managed cluster serving many tenants can be cheaper.

When self-managed is the wrong answer

The more common mistake is going self-managed because the team underestimated operational burden:

"We want to learn Kubernetes from the inside out." Fine for a lab, not a production choice.
"We want control over everything." Control you don't use is just complexity. Don't pay for it.
"Managed is too expensive." The control plane is usually under 5% of total cluster cost. Running your own is unlikely to save money once you factor in the SRE time it costs.
"We don't trust the cloud provider with our control plane." If you don't trust them with the control plane, you shouldn't trust them with the nodes either, the nodes run the actual workloads.

The hidden operational cost of each choice

The comparison most marketing materials skip:

The headline: most of your operational work is the same either way. Managed saves you on two big items (control plane and etcd) and nothing else.

The decision framework

For 90% of teams, here's the honest decision tree:

The last part matters: which managed service you pick should almost always match your existing cloud footprint. If your company runs on AWS, you pick EKS. On GCP, GKE. On Azure, AKS. The integration value (IAM, VPC, load balancers, managed databases) dwarfs any Kubernetes-level differences.

PRO TIP

Resist the urge to pick the "best" Kubernetes service. They're all fine for 95% of workloads. The wrong answer is running Kubernetes on a cloud you don't otherwise use, integration friction alone wipes out any Kubernetes-level advantage.

GKE Autopilot and EKS Auto Mode: the "super-managed" tier

Both GKE and EKS now offer a mode where even node management is handled. You define workloads; the provider provisions nodes behind the scenes.

Pros:

Simplest possible operational experience.
Automatic right-sizing.
No node-level concerns at all.

Cons:

Higher per-resource cost (you pay for the management).
Less control when you do need to customize (DaemonSets, specific instance types).
Newer: some corner cases still emerge.

When it's right: teams without dedicated platform engineering, where developer productivity matters more than per-workload cost optimization.

When it's wrong: teams running GPU workloads, specific hardware needs, very large scale, or strong customization requirements.

The total cost comparison: honestly

A concrete example. Mid-sized production cluster: 50 nodes, mostly m5.xlarge.

Cost category	Managed (EKS)	Self-managed
Control plane	$73/mo	$0 (but nodes instead)
Worker nodes	~$7,500/mo	~$7,500/mo
Engineer time (control plane)	~0 hours/mo	~20 hours/mo
Engineer time (nodes/workloads)	~40 hours/mo	~40 hours/mo
Total $$ directly	~$7,600/mo	~$7,500/mo
Total $$ incl. $200/hr engineer	~$15,600/mo	~$19,500/mo

Self-managed saves $100/mo in hosting. Managed saves ~$4,000/mo in engineer time. Call the trade even if you ignore the time, or ~25% cheaper including time.

Plus: the managed control plane has a 99.95% SLA. If you run the control plane yourself, that's your uptime commitment now.

Choosing between multiple clusters

One more real-world question: one big shared cluster or many small per-team clusters?

One big cluster:

Cheaper (control plane, spare capacity, ops overhead shared).
Multi-tenancy is harder but solvable with namespaces + quotas + policies.
A single incident can affect everyone.

Many small clusters:

Fault isolation (a broken cluster is one team's problem).
Simpler per-team RBAC.
More money (more control planes, more spare capacity, more ops overhead).

At small scale (< 5 teams), one cluster. At medium scale, a few clusters split by sensitive/non-sensitive or by business unit. At very large scale (hundreds of teams, regulated industries), many clusters become essential.

KEY CONCEPT

The instinct to give each team its own cluster is usually a failure to build proper multi-tenancy. Fix the multi-tenancy story (Module 3 + Module 5) before adding clusters. The per-cluster cost adds up fast, and the operational complexity of many clusters is substantial.

Hybrid configurations

A few teams end up in genuine hybrid situations:

Self-managed control plane, managed nodes: rare. Complex. Usually not worth it.
Managed primary cluster, self-managed secondary for sensitive data: legitimate regulatory pattern.
Managed for services, self-managed for GPU training: seen at AI companies where GPU cluster operational needs are unique.

If you find yourself here, the hybrid adds operational complexity. Make sure the benefit (cost, compliance, control) justifies the two-way maintenance.

Practical recommendation

For most teams reading this course:

Use managed Kubernetes. EKS, GKE, or AKS depending on cloud.
Start with one cluster per environment (dev, staging, prod).
Invest in multi-tenancy before adding more clusters.
Budget for nodes, not the control plane. Control plane costs are rounding errors.
Plan for Day 2 from Day 0. Rest of this course.

Self-managed is not a failure mode, it's a legitimate choice for specific situations. But it's a choice to walk into with eyes open, not default into because "managed felt expensive."

Quiz

KNOWLEDGE CHECK

Your team runs on AWS. A consultant recommends self-managed Kubernetes using kubeadm across EC2 instances because it gives you more control. Which of these is the strongest counter-argument?

What to take away

Managed Kubernetes handles the control plane and etcd. That's it. Everything user-facing is still yours.
The operational work that dominates Day 2 (workloads, monitoring, security, cost) is the same either way.
Use managed unless you have specific on-prem, sovereignty, or extreme customization requirements.
Pick EKS/GKE/AKS based on your cloud footprint, not Kubernetes features.
"Super-managed" tiers (Autopilot, Auto Mode) save even more ops time, at a premium: and are right for teams without dedicated platform engineering.
One cluster with strong multi-tenancy beats many small clusters at small/medium scale.
Cost difference between managed and self-managed is usually dwarfed by engineer-time difference. Budget accordingly.

Next lesson: the Day 0 / Day 1 / Day 2 mental model. Why the work after go-live dominates the work of going live, and how to plan for it.

Your Cluster Is Not Production-Ready

Continue

Cluster Lifecycle Thinking

←→ navigateM toggle sidebar