Kubernetes System Design Interview Prep

The 5-Step Reasoning Framework

You have prepared for weeks. You can recite the life of a pod. You know the difference between a Deployment and a StatefulSet. You have read the etcd design doc twice.

The interviewer says: "Design Kubernetes for a fintech company."

You freeze.

Where do you start? Authentication? The control plane? Payments compliance? Node pools? Networking? The blank whiteboard is somehow harder than any YAML you have ever written, and the silence is stretching into an uncomfortable second, then a third, then a fifth.

This lesson exists so that never happens to you again. The 5-step framework is a set of verbal rails you can apply to any K8s system design question: fintech, gaming, e-commerce, IoT, AI, and produce a structured, scorable answer every time.

Why a Framework Beats Improvisation

Every strong senior engineer I have interviewed who passed the loop had some version of a framework. The names differed, the step count differed, but the pattern was the same: a sequence of verbal moves that turned vague prompts into structured responses.

The framework does three things for you:

It removes the cold-start problem. You never stare at a blank whiteboard, because you always start at Step 1.
It signals structure to the interviewer. When you say "let me cover the non-functional requirements next," the interviewer immediately recognizes the move and gives you credit.
It prevents you from skipping load-bearing sections. Without a framework, almost everyone forgets either capacity math or trade-offs under pressure. The framework makes those steps non-skippable.

KEY CONCEPT

The framework is not a script. It is a set of rails. You will go off the rails when the interviewer probes, that is expected and good. The rails are there so you can always get back on after the detour.

The 5 Steps at a Glance

The 5-Step K8s System Design Framework

Click each step to explore

Step 1: Clarify (2-5 Minutes)

Before you design, you need four categories of information: functional requirements, non-functional requirements, constraints, and unknowns. Go in this order.

Functional Requirements

What does the system need to do? What workloads run on it? Stateless, stateful, batch, streaming, ML? Internal-facing or customer-facing? Read-heavy or write-heavy? The functional layer tells you what kind of primitives you will need: Deployments, StatefulSets, Jobs, CronJobs, Operators, before you pick any.

Non-Functional Requirements

How well must the system do it? This is the quantitative layer: requests per second, p50 and p99 latency, availability SLA, durability, throughput. Ask for numbers. If the interviewer is vague, offer a range: "would you say something like 10,000 RPS peak, or more like 100,000?"

Constraints

What is outside your control? Cloud or on-prem, budget, team size and seniority, compliance requirements (PCI, HIPAA, SOC2, FedRAMP), existing tooling, timeline. Constraints are usually the single biggest driver of the design. A two-engineer team building on a $5k/month budget gets a radically different design than a 50-engineer platform org with unlimited budget.

Unknowns

What do you not know and are going to assume? Surface assumptions explicitly so the interviewer can correct you. "I am going to assume this is a single region for now unless you tell me otherwise." Unstated assumptions are silent landmines.

PRO TIP

Write the four categories as column headers on the whiteboard before you ask the first question. You fill in cells as the interviewer answers. This visual structure alone often earns you "exceeds" on the clarifying-questions row of the rubric.

Sample Clarifying Questions

# Functional
- What workloads are we running? Stateless web, stateful DBs, batch, ML?
- Any specialized hardware? GPUs, high-memory, IOPS-heavy storage?
- Internal-only, customer-facing, or both?

# Non-functional
- Target RPS at peak? Current and 12-month projected?
- p50 and p99 latency targets?
- Availability SLA — 99.9, 99.95, 99.99?
- RPO and RTO for stateful workloads?

# Constraints
- Which cloud, or on-prem, or hybrid?
- Budget or cost target?
- Team size and Kubernetes experience?
- Compliance requirements?
- Existing tooling we must integrate with?

# Unknowns
- Are there any requirements I should ask about that I have not?

WARNING

Do not ask all 15 questions. Pick 4 to 6 that most shape the design and go deep on those. Asking every question sounds like you are running down a checklist rather than reasoning about what matters for this problem.

Step 2: Capacity (5-7 Minutes)

This is the step almost everyone skips under pressure, and the single biggest differentiator for staff-level scoring. You are going to turn the non-functional requirements into a set of numbers.

The Basic Chain

The capacity chain for stateless web services looks like this:

RPS (requests per second)
  × concurrency factor (latency × RPS)
  ÷ pod throughput
  = pod count

pod count
  × pod resources (CPU, memory)
  ÷ usable node capacity
  = node count

node count
  × 1.5 (burst) × (N+1 for AZ) × (1.15 for maintenance)
  = final node count

Worked Example

The interviewer says: 20,000 RPS, p99 latency 300ms, 99.9 percent availability, AWS.

Step through the math out loud:

Concurrent requests in flight = 20,000 × 0.3 seconds = 6,000.

Assuming a pod handles 100 concurrent requests at the latency target, pod count = 60.

Each pod requests 500m CPU and 1Gi memory. That is 30 CPU and 60Gi memory total.

On m6i.2xlarge with 8 vCPU and 32Gi memory, after 10 percent system overhead, usable is roughly 7 vCPU and 28Gi per node. That fits about 14 pods per node by CPU.

60 pods ÷ 14 pods per node = 5 nodes.

Apply safety factors: 1.5x burst = 8 nodes. N+1 for AZ failure across 3 AZs = 9 nodes (3 per AZ + 1 spare). Plus 15 percent for maintenance headroom = 10 or 11 nodes.

I would land on 12 nodes as a round number, knowing I will right-size after observing production.

KEY CONCEPT

The goal is not to be exactly right. The goal is to show structured quantitative reasoning. An interviewer scoring this would give you exceeds on capacity reasoning even if your numbers were off by 2x: because you showed units, assumptions, and safety margins explicitly.

What to Also Estimate

Beyond pods and nodes, strong candidates also estimate:

Etcd object count. Each pod roughly adds 3 to 5 etcd objects (pod, endpoints, events). 10,000 pods is around 40k etcd objects, well within limits but worth naming.
Storage. If you are running stateful workloads, estimate PV count and total TB.
Network. East-west RPS, cross-AZ bytes per second, ingress bandwidth.
Cost. Roll it up: X nodes at $Y per hour × 730 hours per month.

# Quick cost estimate template
monthly_cost:
  compute:
    nodes: 12
    instance_type: m6i.2xlarge
    on_demand_hourly: 0.384
    monthly: 12 * 0.384 * 730  # ~$3,365
  control_plane:
    managed_eks: 73  # $0.10/hr
  storage:
    ebs_gp3: 500  # 5TB at $0.08/GB-month
  network:
    data_transfer: 300  # rough estimate
  total_monthly: ~4,238

Step 3: Architecture (15-20 Minutes)

This is the bulk of the interview: where you draw boxes, lines, and explain how the cluster actually works. Cover five layers, in this order.

The Five Architecture Layers

The Five Architecture Layers to Cover

1. Control Plane

Managed vs self-hosted, HA topology (3 or 5 nodes), etcd sizing and tuning, API server rate limits, admission controllers.

2. Data Plane

Node pools, instance types, autoscaling (Cluster Autoscaler vs Karpenter), spot vs on-demand, kubelet tuning, OS choice.

3. Networking

CNI (Calico, Cilium, AWS VPC CNI), Service types, Ingress controller, service mesh (yes/no and which), east-west and north-south policies.

4. Storage

CSI driver, storage classes, PVC sizing, snapshot and backup strategy, stateful workload placement (nodegroups with persistent storage).

5. Observability and Platform

Metrics (Prometheus), logs (Loki, Fluentbit), traces, alerting, GitOps (Argo, Flux), secrets (external-secrets, Vault), policy (OPA, Kyverno).

Hover to expand each layer

Example Dialog

When you are at the whiteboard, walk through each layer like this:

Let me start with the control plane. For a fintech workload at this scale, I would choose managed EKS over self-hosted for two reasons: one, the team is eight people and managing control plane upgrades eats calendar; two, compliance audits are easier with a managed service. The trade-off is we give up etcd tuning, which matters if we hit control plane limits. At 10,000 pods we are nowhere near that ceiling.

Next, the data plane. I would run two node pools: a general pool of m6i.2xlarge on-demand for latency-sensitive services, and a batch pool of m6i.xlarge spot for async jobs. Karpenter for autoscaling because it beats Cluster Autoscaler at mixed-instance binpacking and we get faster node provisioning.

Networking. AWS VPC CNI with prefix delegation so we get more pods per node and stay inside VPC routing, this is usually the right default on EKS. Calico for NetworkPolicy on top since VPC CNI does not enforce policy. For ingress, AWS Load Balancer Controller with NLBs for each public-facing service. No service mesh in v1, we can add Linkerd in year two if we need mTLS and better per-service observability, but it is not free operationally.

PRO TIP

Always state why you did not pick the obvious alternative. "No service mesh in v1" earns more credit than silently omitting service mesh. The interviewer now knows you considered it and had a reason.

How Deep to Go on Each Layer

You have roughly 20 minutes for architecture. That is 4 minutes per layer if you split evenly, but you should not split evenly. Go deep on the layers most relevant to the prompt. For a fintech prompt, networking and observability deserve extra time. For an ML prompt, the data plane and storage layers deserve extra time.

# Depth allocation heuristic
for layer in control_plane data_plane networking storage observability; do
  if [[ $layer is central to prompt ]]; then
    allocate 6-8 minutes
  elif [[ $layer is peripheral to prompt ]]; then
    allocate 2-3 minutes, mention key choices, move on
  fi
done

Step 4: Trade-Offs (10 Minutes)

At every major architectural decision, name the alternative you did not pick and explain the cost. Do this inline during Step 3, but also reserve 10 minutes at the end for explicit trade-off review.

The Trade-Off Template

For each decision, verbalize:

I chose X over Y because of Z. The cost of X is W. If Z were different, I would reconsider.

Concrete examples:

I chose managed EKS over self-hosted because the team is small. The cost is giving up etcd tuning. If we grew to 50 engineers and hit control plane limits, I would revisit.

I chose Karpenter over Cluster Autoscaler because of binpacking quality and faster provisioning. The cost is Karpenter is newer and has fewer community runbooks. If the team was not comfortable adopting newer tooling, I would pick Cluster Autoscaler.

I chose spot for the batch pool because 70 percent cost savings. The cost is reclaim interruptions. If batch jobs had strict SLOs I would not use spot.

KEY CONCEPT

Interviewers are mentally running a tally of "decisions made with explicit trade-offs" versus "decisions made silently." At L5 plus you need most decisions on the explicit side. Silent decisions read as pattern-matching, not reasoning.

Top 5 Trade-Offs to Always Cover

Even if you cover no others, these five show up in almost every K8s design:

Managed vs self-hosted control plane (team capacity, control, cost)
Single vs multi-cluster (blast radius, operational complexity, cost)
Service mesh vs no mesh (observability/mTLS vs latency/ops overhead)
Spot vs on-demand (cost vs reliability)
GitOps vs imperative deploys (velocity vs auditability)

Step 5: Follow-Ups (5 Minutes)

Close strong by volunteering what you would want to cover if you had more time. This signals production maturity and gives the interviewer easy follow-on questions.

Categories to Mention

Failure modes not yet discussed: "I did not walk through etcd quorum loss, that is when writes fail but reads continue from watch caches until they fall out of cache."
Edge cases: "If we lose an entire AZ at the same time as a control plane upgrade, we are below quorum for a window."
Scale-out: "At 50,000 RPS this design starts to strain. We would shard by workload or region."
Operational concerns: "I would want to define the on-call rotation and incident response before going live."
Future evolution: "In year two I would reevaluate service mesh adoption and multi-region."

PRO TIP

End with an open question back to the interviewer. "Is there a specific area you want me to go deeper on?" This gives them an easy prompt and shows you are collaborative rather than performing.

Signaling the Framework to the Interviewer

The interviewer is scoring you. To score you accurately, they need to recognize each step. Use explicit verbal markers.

Step 1 marker: "Before I design, let me ask a few clarifying questions."
Step 2 marker: "Let me put some rough numbers on this."
Step 3 marker: "I will start with the control plane, then work out to the data plane."
Step 4 marker: "Let me walk through the trade-offs on the key decisions."
Step 5 marker: "A few things I would want to cover with more time."

These phrases are the difference between a strong answer that feels structured and a strong answer that feels unstructured. The content can be identical; the scoring is not.

WARNING

Do not be rigid. If the interviewer interrupts Step 2 with a Step 3 question, follow them. You can say "great question, let me answer that and then come back to capacity", but do not insist on completing every step in order if the interviewer is pulling you somewhere else.

Worked Example: "Design K8s for a B2B SaaS"

Let us apply the framework end to end in compressed form.

Step 1: Clarify

Functional, is this a single multi-tenant app, or are we running customer workloads? Non-functional: what is the customer count, peak RPS, SLA? Constraints: cloud, budget, team, compliance?

Interviewer answers: single multi-tenant SaaS app, 500 customers, 5k RPS peak, 99.95 SLA, AWS, 5-engineer platform team, SOC2 required.

Step 2: Capacity

5k RPS × 200ms p99 = 1,000 concurrent requests. 50 concurrent per pod = 20 application pods. Plus 10 for workers, 5 for auth, 5 for ingress controllers = 40 pods. At 10 pods per node, that is 4 nodes, and with 1.5x burst and N+1 across 3 AZs I land at 9 to 10 nodes.

Step 3: Architecture

Control plane: managed EKS. Data plane: one node pool on m6i.xlarge on-demand, Karpenter. Networking: AWS VPC CNI + Calico policies for SOC2 segmentation. Storage: gp3 EBS via CSI for the database PVC. Observability: managed Prometheus + Grafana Cloud to minimize platform team load. GitOps with Argo.

Step 4: Trade-Offs

EKS over kops, team size. Single cluster vs prod/staging split. I would actually split prod and staging into two clusters for SOC2 blast radius. No service mesh in v1, not worth the ops overhead at this scale. Spot not used, 99.95 SLA is too tight for spot interruptions in the hot path.

Step 5: Follow-Ups

At 5k RPS we have headroom. If the customer count 10x in a year, I would reevaluate the tenancy model, specifically, whether we stay multi-tenant in one app or move to per-tenant namespaces. I would also want to discuss backup and restore for the database PVC, which I glossed over.

That entire answer fits in 35 minutes, touches every layer, names multiple trade-offs, and closes with follow-ups. That is a mid-L5 answer at minimum.

How to Answer in an Interview

Below is a snippet of how the framework feels in the interview room. The prompt is "Design Kubernetes for a fintech company."

Interviewer: Design Kubernetes for a fintech company.

Candidate: I am going to structure this in five steps: clarify, capacity, architecture, trade-offs, follow-ups, and check in with you between each. I will start with some clarifying questions.

Interviewer: Sounds good.

Candidate: What is the primary workload: customer-facing apps, ledger/matching engine, batch reporting, a mix? What is the peak RPS? What is the availability SLA? Are we PCI scope, SOC2, both? What cloud, and what is the team size?

Interviewer: Customer-facing mobile and web, plus a ledger service. 15k RPS peak on the app tier. 99.99 SLA on the ledger, 99.9 on the app tier. PCI DSS level 1. AWS, 12-engineer platform team.

Candidate: Thanks. Two assumptions I will make unless you correct me: single region for now with multi-AZ, and we are building greenfield. Let me move to capacity.

[Candidate does capacity math, lands on roughly 30 nodes split across two pools.]

Candidate: Good. Let me move to architecture. I will start with control plane, then data plane, networking, storage, and observability. Because this is PCI scope, I am going to lean toward stronger isolation at the networking and storage layers, I will flag that as we go.

Notice the pattern: named the framework, asked structured questions, surfaced assumptions, previewed the next step, then moved. That is the framework working as intended.

Key Concepts Summary

The 5-step framework is Clarify, Capacity, Architecture, Trade-offs, Follow-ups, applied in order with verbal markers between steps
Clarify covers four categories: functional, non-functional, constraints, unknowns. Ask 4 to 6 questions, not 15
Capacity turns requirements into numbers: RPS → concurrency → pods → nodes, with explicit safety margins for burst, AZ failure, and maintenance
Architecture covers five layers: control plane, data plane, networking, storage, observability. Go deeper on the layers most relevant to the prompt
Trade-offs are named explicitly at every major decision, "X over Y because Z, cost is W"
Follow-ups close the interview by volunteering failure modes, edge cases, scale-out plans, and future evolution
Explicit verbal markers between steps let the interviewer score each step accurately, silent structure does not count
The framework is rails, not rigid. Follow the interviewer when they probe, but always come back to the framework

Common Mistakes

Skipping Step 2 because you feel capacity math is too slow under pressure, this is the single most costly skip in the whole framework
Spending 10-plus minutes in Step 1 because you are uncomfortable committing to assumptions
Going deep on every architecture layer equally instead of prioritizing the ones most relevant to the prompt
Making trade-offs silently ("I will use Karpenter") without naming the alternative ("over Cluster Autoscaler, because...")
Forgetting Step 5 entirely and letting the interview end on an architectural high note instead of closing with production maturity signal
Rigidly completing every step when the interviewer is clearly trying to redirect you
Using the framework mechanically so it sounds like recitation rather than reasoning, the rails should be invisible, the reasoning should be visible

What is Next

In the next lesson, we will turn the framework inside out and study the patterns that cause candidates to fail even when they know the framework. The anti-patterns lesson covers the specific mistakes that separate the candidate who gets the offer from the candidate who does not, with real examples of each and how to recover mid-interview when you realize you have started wrong.

KNOWLEDGE CHECK

During Step 3 (Architecture), the interviewer interrupts and asks a detailed networking question about CNI choice. You have not yet covered storage or observability. What is the best response?

The K8s Interview Format

Continue

Common Anti-Patterns That Fail Interviews

←→ navigateM toggle sidebar