All posts
DevOpsBeast

Most Courses Teach Tools. Senior DevOps Interviews Test Architecture. Here's the Gap.

After 50+ senior DevOps interviews on both sides of the table, the same pattern keeps repeating: courses teach tools, interviews test architecture, and strong operators freeze the moment a question turns from 'what does this do' to 'design this and defend it.' The five reasoning questions senior candidates actually fail, what a knowledge answer looks like versus a reasoning answer, and how to close the gap.

By Sharon Sahadevan··11 min read

After 50-plus senior DevOps interviews on both sides of the table, I kept noticing the same gap. It is not a knowledge gap. The candidates who fail the senior loop usually know more tools than the ones who pass. They can recite kubectl flags, name every AWS service, and have shipped real infrastructure. Then the interviewer asks them to design something, pushes back on a choice, and the answer falls apart.

The mismatch is structural: most courses teach tools, and most senior interviews test architecture. A KodeKloud subscription will teach you every kubectl flag. It will not teach you how to design a multi-tenant Kubernetes platform for 200 teams under a 45-minute clock. An A Cloud Guru course will teach you what each AWS service does. It will not teach you how to defend an architecture decision when an interviewer leans in and says "why not the other way?" Those platforms are excellent at what they do. What they do is not what the senior loop tests.

This post is the gap itself: the difference between a knowledge question and a reasoning question, the five questions I have watched strong senior candidates fail in the last year, what separates a weak answer from a strong one on each, and how to actually close the distance before your next interview.

Knowledge questions vs reasoning questions#

A knowledge question has an answer you either know or you don't. What does kubectl drain do? What is the default terminationGracePeriodSeconds? Which AWS service gives you managed Kubernetes? These dominate junior and mid-level interviews, and they are exactly what tool-focused courses train you for. You can study them flat — flashcards, docs, repetition.

A reasoning question has no single answer. It has a space of answers, each with trade-offs, and the interviewer is watching how you move through that space. Design an inference serving platform for 10K requests per second. Your cluster is melting under kube-proxy load — migrate it without an outage. Defend choosing IPVS over eBPF here. There is no fact to recall. There is only the quality of your thinking made visible: do you ask about constraints before designing, do you name trade-offs without being prompted, do you change your answer gracefully when new information arrives, or do you defend the first thing you said because changing it feels like losing.

KEY CONCEPT

The senior loop is not testing what you have memorized. It is testing how you think about systems under uncertainty and pressure. Knowledge is the price of admission — you need it, but it does not differentiate you, because every other senior candidate has it too. What differentiates is reasoning: structuring an ambiguous problem, surfacing trade-offs unprompted, and defending a decision while staying open to a better one. Tool courses optimize the thing that no longer differentiates you.

The five questions senior candidates actually fail#

Here are five real questions — the kind that come up at Atlassian, Netflix, Stripe, and the FAANG names — and for each, the difference between the answer that fails and the answer that passes.

1. "Design an inference serving platform: 10K req/s, p99 under 500ms, a 70B model across multiple GPU nodes."#

The knowledge answer: "I'd deploy the model with vLLM behind a load balancer and put an HPA on it." Correct components, zero reasoning. It names tools and stops.

The reasoning answer starts by interrogating the constraints — is the 500ms p99 time-to-first-token or end-to-end? What's the input/output token distribution? — because the architecture changes completely depending on the answer. Then it reasons through the real levers: continuous batching to keep the GPU fed (the continuous batching post is the mechanism), KV-cache sizing and what happens when it saturates, and the autoscaling signal — explaining why you cannot scale this on CPU or GPU utilization and what you scale on instead, which is the entire point of the LLM autoscaling post. The strong candidate volunteers the cold-start problem before being asked, because they know a reactive autoscaler is 90 seconds too late for a spiky SLO. The gap between the two answers is not knowledge of vLLM. It is whether you can reason about why the obvious design misses the latency target.

2. "Your cluster is at 5,000 Services and kube-proxy is at 100% CPU. Migrate to fix it without breaking production."#

The knowledge answer: "Switch kube-proxy from iptables mode to IPVS." That is the correct destination — and a senior interviewer will immediately push back: why is iptables the problem, and how do you migrate 5,000 Services without an outage?

The reasoning answer explains the mechanism — iptables rule evaluation is O(n) in the number of Services, so at 5,000 Services every packet traverses a linear rule chain and the control plane chokes rewriting it, while IPVS uses hash-based lookup (the kube-proxy iptables vs IPVS post is this exact question) — and then treats the migration as the real test: node-by-node with a canary pool, validating data-plane correctness on the IPVS nodes before draining the iptables ones, with a rollback path. The knowledge answer knows the destination. The reasoning answer owns the journey, which is the part production actually punishes.

3. "Walk me through a multi-tenant Kubernetes platform: isolation, cost attribution, onboarding a new team."#

The knowledge answer: "Namespaces per team, with RBAC and resource quotas." Every senior candidate says this. It is table stakes, not a differentiator.

The reasoning answer treats "isolation" as a spectrum and reasons about where on it this platform needs to sit — namespace soft multi-tenancy vs. virtual clusters vs. separate clusters, and the cost/blast-radius trade-off of each. It covers the dimensions the knowledge answer skips: network isolation (default-deny NetworkPolicies, not just RBAC), noisy-neighbor protection, who owns the node pools, how cost is attributed back to teams so the bill creates accountability, and what the onboarding workflow actually looks like as a self-service paved road rather than a ticket queue. This one rarely maps to a single tool — it maps to whether you can hold five competing concerns in your head and sequence them, which is exactly what the Kubernetes system design interview course drills.

4. "Your LLM cluster is at 92% HBM, KV cache is 60% of it, and buying more GPUs is not financially viable. Design the fix."#

The knowledge answer: "Add more GPU nodes." The question explicitly forbids it, and reaching for it anyway signals you did not hear the constraint — an instant senior-loop fail.

The reasoning answer works the actual levers, ordered by impact: quantization to shrink the weights and free HBM, KV-cache reduction (paged attention, prefix caching for shared prompts, shorter context windows where the product allows), and — when the cache is genuinely the wall — KV-cache offload and prefill/decode disaggregation, which is precisely the KV cache wall / Mooncake post. The reasoning candidate names the trade-off of each (quantization costs accuracy, disaggregation costs architectural complexity and network bandwidth) instead of presenting one as free. The whole question is a test of whether you can optimize within a hard constraint rather than spend your way out — which is the GPU cost optimization and LLM Operations mindset.

5. "Your GitHub Actions OIDC to AWS is failing with AccessDenied. Walk me through the eight things that could be wrong."#

The knowledge answer: "Check the IAM role permissions." That is one of the eight, and the shallowest one.

The reasoning answer demonstrates a mental model of the whole trust chain — the trust policy's sub claim not matching the repo/branch/environment, the aud audience condition, the OIDC provider thumbprint, the missing id-token: write permission in the workflow, the assumed-role session policy, and so on. The GitHub Actions OIDC to AWS post walks all eight. The point of the question is not the list — it is whether you can enumerate a failure surface systematically under pressure instead of guessing one cause and hoping. Senior debugging is breadth-first, hypothesis-driven enumeration, and this question makes that visible in two minutes.

Why tool-focused courses can't close this gap#

Notice what every strong answer above has in common: it is not a different fact than the weak answer. It is a different move. Interrogate the constraint before designing. Name the trade-off without being asked. Sequence a migration with a rollback. Enumerate a failure surface systematically. Optimize within a hard limit instead of spending past it. None of those is a tool. All of them are habits of reasoning.

Tool courses cannot teach reasoning because their format does not exercise it. A lesson that teaches you what IPVS is, even a great one, leaves you at the knowledge answer — you know the destination. The reasoning ("why is iptables O(n), how do I migrate 5,000 Services live, what's my rollback") only develops when you practice the move, repeatedly, against realistic scenarios with someone pushing back. That is a different kind of material, and most catalogs are not built to deliver it. They are built to deliver coverage, breadth, and certification rubrics — which are real and valuable goals, just not this one.

PRO TIP

You can self-diagnose which gap you have. Take any system you know well and, out loud, design it from constraints — then argue against your own first answer as if you were the interviewer. If you can generate the trade-offs and defend a revised position, you have the reasoning. If you find yourself reciting components and going quiet when pushed, you have tool knowledge without the architectural layer on top of it — and that is the exact gap the senior loop is built to find.

How to actually close it#

The reasoning layer is trainable, but not by consuming more content. Three things move the needle:

  • Practice from constraints, not components. Don't start an answer by listing tools. Start by extracting and restating the constraints (throughput, latency, budget, blast radius), because the design falls out of the constraints. Interviewers read "let me make sure I understand the requirements" as a senior signal on its own.
  • Rehearse the trade-off out loud. For every choice, say the cost of the alternative unprompted: "I'd use IPVS here because iptables is O(n) at this Service count — the trade-off is IPVS adds operational complexity and a kernel-module dependency." That single habit is most of the gap.
  • Practice changing your mind gracefully. When an interviewer pushes back, the failing candidate defends the first answer because revising feels like losing. The passing candidate says "good point — if that's the constraint, I'd change X because Y." Senior engineers update on new information; the interview is testing whether you can.

This is exactly the gap DevOpsBeast is built around — operational knowledge to architectural reasoning — and nothing more. Every paid course is built from realistic scenarios, architecture design, and the actual interview questions that fall out of them, rather than tool tours. It is the companion thesis to Why I Built DevOpsBeast (and Who It's Not For), which is worth reading if you want the honest version of who the platform is and isn't for.

The honest routing#

If closing this gap is what you need: the free courses — Linux, Networking, Docker, Git, Bash, and Observability — will tell you within a few minutes whether the depth matches what you're after, no email required. If they land, the paid courses (Production GPU Infrastructure, LLM Operations, Kubernetes Security, Kubernetes Performance Optimization, Identity & Trust, and the Kubernetes System Design Interview Prep) go further on the same patterns.

If they don't land, the other resources I respect serve a real purpose — KodeKloud and A Cloud Guru for tools and certs, Educative and ByteByteGo for broad system design. No shame in routing there. The best thing a specialist resource can do is be honest about which problem it solves.

The gap between knowing the tools and reasoning about the architecture is the one that decides senior interviews — and, not coincidentally, the one that decides whether Monday morning's incident goes well. Both reward the same thing: not what you have memorized, but how you think when the answer isn't given.


The realistic-scenario, architecture-reasoning format this post describes is the spine of the Kubernetes System Design Interview Prep course and the paid track behind it — Production GPU Infrastructure, LLM Operations, Kubernetes Security, and Kubernetes Performance Optimization. Related reading on the specific questions above: Your cluster has 5,000 Services and kube-proxy is the bottleneck, Your LLM cluster is at 90% HBM and 60% is KV cache, How GitHub Actions OIDC to AWS actually works (and the eight ways it breaks), and Your HPA scales LLM pods on CPU. For worked examples of these reasoning questions in full, see A Pod in Your Cluster Just Got Compromised. Walk Me Through the Blast Radius. (the security round) and You Changed the Prompt. Is the Model Better or Worse? (the MLOps round). The companion positioning piece is Why I Built DevOpsBeast (and Who It's Not For).

More in DevOpsBeast