Kubernetes Performance Optimization

Kubernetes Metrics Deep Dive

You installed Prometheus and Grafana. You have 10,000 metrics. You have no idea which ones matter. Design the performance monitoring dashboard.

There is a rough parallel between LLM observability and Kubernetes observability: the platform exposes thousands of metrics, but ten of them carry most of the diagnostic weight. The other 9,990 are useful in specific investigations, useless for daily monitoring, and actively harmful if you alert on all of them.

This lesson is the short list. The metrics that actually predict whether your cluster is performing well, what each one measures, and the queries you keep on a permanent dashboard.

The problem

Three patterns I see again and again on production clusters:

  1. The vanity dashboard. Forty panels showing every metric the team thought looked interesting once. Nobody knows which ones matter. When something breaks, the dashboard provides no signal.
  2. Alert fatigue from the metrics that do not matter. Pages firing on node CPU above 80% (which is fine), pod restarts in CrashLoopBackOff (which is the application team's problem), and disk usage above 70% (which means you sized correctly). The on-call ignores them.
  3. The metric that would have caught the incident, except nobody was scraping it. etcd fsync duration when the cluster melts down. CPU throttling when the latency-sensitive service starts missing SLOs. Scheduler queue depth when the cluster cannot keep up with traffic.

The fix is the same in all three cases: a small, deliberate set of metrics tied to the four pillars from the previous lesson, with clear ownership and alerting thresholds.

KEY CONCEPT

A useful production dashboard fits on one screen and answers one question: is the cluster performing within its SLOs? Anything more and you are building documentation, not monitoring. Anything less and you are flying blind. The discipline is in deciding what to leave out.

How it works under the hood

Kubernetes metrics come from three different sources, each with its own scrape path and its own quirks. Knowing where a metric comes from helps you trust or distrust it during an incident.

Where Kubernetes performance metrics come from

Control plane metrics
kubelet metrics
metrics-server
Node exporters
Prometheus

Hover components for details

A few specific gotchas worth knowing:

  • kubectl top reads metrics-server, not Prometheus. If metrics-server is broken, kubectl top returns errors but Prometheus still works. They are different code paths.
  • kubelet exposes per-container metrics via cAdvisor at /metrics/cadvisor, not /metrics. Misconfiguring your scrape config to only hit /metrics silently drops half the data you need.
  • Control plane components on cloud-managed clusters (EKS, GKE, AKS) often expose limited or zero metrics. You get the workload-side metrics; control-plane internals are abstracted away. We cover the implications in Module 7.
  • Scrape interval matters. A 60-second scrape interval misses fast-moving phenomena like brief CPU throttling spikes. 15-30 seconds is the typical production setting.

Diagnosis and measurement

The ten metrics that actually carry the weight, organized by pillar.

Pillar 1: Control plane latency

# 1. API server request duration (p99 by verb)
histogram_quantile(0.99,
  sum by (le, verb, resource) (
    rate(apiserver_request_duration_seconds_bucket[5m])
  )
)

# 2. API server in-flight requests
sum by (request_kind) (apiserver_current_inflight_requests)

# 3. etcd fsync latency (p99)
histogram_quantile(0.99,
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)

API server p99 over 1 second usually means etcd is unhappy or the API server is overloaded. etcd fsync over 100ms p99 is a disk I/O problem, period. We cover etcd in detail in Module 2; for now, the metrics above tell you whether to even look there.

Pillar 2: Scheduling speed

# 4. Scheduler attempt duration p99
histogram_quantile(0.99,
  sum by (le) (
    rate(scheduler_pod_scheduling_attempt_duration_seconds_bucket[5m])
  )
)

# 5. Scheduler pending pods (queue depth)
scheduler_pending_pods

# 6. Scheduler unschedulable pods
sum by (queue) (scheduler_pending_pods{queue="unschedulable"})

Pending pods that never go down means you have a real scheduling problem (capacity, affinity rules, tainted nodes). Unschedulable pods specifically means the scheduler tried and failed to place them; check pod events.

Pillar 3: Pod startup time

# 7. Pod startup duration p95 (Pending to Ready)
histogram_quantile(0.95,
  sum by (le) (
    rate(kubelet_pod_start_duration_seconds_bucket[5m])
  )
)

# 8. Image pull duration p95
histogram_quantile(0.95,
  sum by (le) (
    rate(kubelet_image_pull_duration_seconds_bucket[5m])
  )
)

If pod startup is slow but image pull is fast, the issue is in container creation or app startup. If image pull is slow, it is registry, network, or image size. We cover this fully in lesson 3.2.

Pillar 4: Workload throughput

# 9. Container CPU throttling rate
sum by (namespace, pod) (
  rate(container_cpu_cfs_throttled_periods_total[5m])
)
/
sum by (namespace, pod) (
  rate(container_cpu_cfs_periods_total[5m])
)

# 10. OOMKill rate by namespace
sum by (namespace) (
  rate(container_memory_oom_events_total[5m])
)

The throttling ratio is the single most useful container-level performance metric. A pod with 30% throttling is being held back by its CPU limit; users feel this as latency spikes that look like the application is broken. We cover this end-to-end in lesson 1.5.

These ten queries fit on one Grafana dashboard with one panel per query. That dashboard is the single most important thing your platform team can build.

The fix

For most teams, the work to operationalize this is concrete and bounded:

  1. Confirm Prometheus is scraping all the right targets. kubelet /metrics, kubelet /metrics/cadvisor, control plane components if you operate them, node-exporter on every node. Missing scrape targets equal missing visibility.
  2. Build the dashboard. Ten panels, one per metric above, p99 lines with thresholds. Save it as your "Cluster Performance" dashboard.
  3. Set SLOs and alert thresholds. API server p99 < 1s, etcd fsync p99 < 100ms, pod startup p95 < 30s, throttling rate < 5%. These are starting defaults; tune them to your baseline.
  4. Page on SLO breaches, not on individual signals. "API server p99 is above 1s for 5 minutes" is an alert. "API server CPU is above 80%" is not. The first signals user impact; the second is just a number.
  5. Quarterly review. Look at the dashboard. Are the metrics still healthy? Are any new failure modes happening that the dashboard does not catch? Iterate.

Useful default Grafana dashboard JSON exists for each component: kube-prometheus-stack ships dozens of dashboards, but the curated ones are too dense for daily use. Take what you need, drop the rest, build your own one-pager.

WAR STORY

A team I consulted with had over 200 Grafana dashboards. Nobody could find anything. We deleted everything except a curated set of seven dashboards: one per pillar, one for nodes, one for cluster cost, one for the on-call summary. On-call response time to incidents dropped by half because people stopped hunting through dashboards looking for the relevant signal. The deleted dashboards were not wrong; they were just noise during an incident. Lesson: dashboards are products, not artifacts. They need to be designed for the user (the on-call), not the author (the engineer who thought it was cool).

Before and after

A typical "added a real performance dashboard" outcome:

CapabilityBeforeAfter
Time to identify which pillar is slow30+ minutesUnder 2 minutes
Alert noise per week100+ pages3-5 pages
Number of dashboards200+ ungroomed7 curated
Coverage of the four pillarsPartialAll four with thresholds
Postmortem data qualityAnecdotalSpecific p99 numbers per pillar

The dashboard does not make the cluster faster on its own. It makes the team faster at finding what to fix.

Common mistakes

  • Scraping too few sources. Missing the kubelet /metrics/cadvisor endpoint is the most common one. You lose container-level metrics and never realize it.
  • Scraping too aggressively. A 5-second scrape interval on every metric in a large cluster will eat your Prometheus instance. 15-30s is right for most workloads.
  • Alerting on capacity instead of performance. "Node CPU is at 80%" is not an incident. "API server p99 is over 2 seconds" is.
  • Dashboards designed for engineers, not operators. A dashboard with 40 panels of detail is a research tool, not a production monitoring surface. Build separate dashboards for each.
  • No baseline overlay. Without a "this week vs last week" comparison, you cannot tell what changed.
  • Treating kubectl top output as ground truth. It reads from metrics-server, which has its own scrape interval and aggregation. For real numbers, query Prometheus.
  • Ignoring slow-moving metrics. etcd database size, certificate expiration, secret rotation lag. These do not move quickly but they kill clusters when they cross thresholds. Add slow-moving alerts to a separate dashboard with different SLO windows.
INTERVIEW QUESTION

What are the top 10 metrics you'd monitor for Kubernetes cluster performance? Why each one?