Kubernetes Performance Optimization

Right-Sizing Workloads with VPA and Goldilocks

You have 200 deployments. Nobody knows what the right resource requests should be. The platform team is tired of guessing. Design the right-sizing system.

Hand-tuning resource requests for one deployment is straightforward. Hand-tuning for 200 deployments owned by 30 different teams is a full-time job that nobody wants. The previous lesson laid out why right-sizing matters; this lesson is about how to actually do it at scale, where the only viable approach is to let tooling produce recommendations and let humans review the diffs.

The problem

Three failure modes I see when teams try to right-size at scale by hand:

  1. They never finish. Right-sizing 200 services manually takes weeks of engineer time. By the time you finish service 200, services 1-50 have changed and need to be re-done.
  2. They under-correct out of caution. When in doubt, set bigger requests. Prevents OOMKills and angry app teams. Keeps the cluster over-provisioned.
  3. They over-correct based on stale data. Look at one week of metrics, set requests based on that week, then traffic 5x's during a marketing push and everything OOMKills.

The right-sizing problem is fundamentally a continuous data analysis problem, not a one-time engineering task. The solution looks like a tool that watches actual resource usage continuously and produces recommendations that humans review and apply on a cadence. Vertical Pod Autoscaler and Goldilocks are the two tools that solve this in the open-source ecosystem.

KEY CONCEPT

Right-sizing is a process, not a project. A team that "did right-sizing once" two quarters ago has stale numbers and is back where they started. The goal is to set up the recommendation pipeline once, then let it run continuously and review diffs on a cadence (weekly or monthly).

How it works under the hood

VPA is the engine. Goldilocks is the dashboard. They work together but solve different problems.

The right-sizing pipeline

Click each step to explore

The reason VPA's recommendation logic works is the histogram model. Instead of just averaging usage (which would under-recommend) or taking the max (which would over-recommend), it builds a distribution of CPU and memory samples over time, gives more weight to recent samples, and produces a target near the 90th percentile of that distribution.

The two important VPA modes for right-sizing:

  • updateMode: "Off": VPA produces recommendations but never modifies pods. This is what you want for production right-sizing. The engineer reads the recommendation, decides, and applies via a PR.
  • updateMode: "Auto": VPA modifies running pods to match recommendations, restarting them in the process. Useful for non-production environments and for specific workloads where automatic adjustment is acceptable.

Goldilocks is the user-facing layer. It creates a VPA in Off mode for every Deployment in a labeled namespace, then renders the recommendations in a dashboard:

# Enable Goldilocks for a namespace
kubectl label namespace my-app goldilocks.fairwinds.com/enabled=true

# View the dashboard (assuming Goldilocks is installed)
kubectl port-forward -n goldilocks svc/goldilocks-dashboard 8080:80

The dashboard shows current vs recommended for each container in each deployment, with the gap highlighted. A team can pull up "all 50 of our deployments" and see the right-sizing diff in one view.

Diagnosis and measurement

Before installing VPA, sanity-check that you have the prerequisites and that VPA's recommendations will be meaningful:

# 1. Confirm metrics-server is running and serving data
kubectl top pods -A | head -20
# If this fails, fix metrics-server first.

# 2. Confirm pods have run long enough for VPA to have a good signal
# VPA defaults to 24 hours of observation before producing recommendations.
# Workloads that just deployed will have lower-confidence recommendations.

# 3. Check that the workload has stable behavior to recommend against
# Pods that crash-loop or restart frequently produce noisy recommendations.

Once VPA is producing recommendations, the queries that matter:

# How much waste is each namespace currently carrying?
sum by (namespace) (
  kube_pod_container_resource_requests{resource="memory"}
)
-
sum by (namespace) (
  container_memory_working_set_bytes
)

# Container-level overprovisioning ratio (CPU)
sum by (namespace, pod, container) (
  kube_pod_container_resource_requests{resource="cpu"}
)
/
sum by (namespace, pod, container) (
  rate(container_cpu_usage_seconds_total[1h])
)

A ratio above 5x is the easy-win population. Target those first.

The fix

A practical right-sizing rollout plan for a 200-deployment cluster:

Week 1: Set up the pipeline.

  • Install VPA (recommender component only; you do not need the updater for this workflow)
  • Install Goldilocks
  • Label staging namespaces with goldilocks.fairwinds.com/enabled=true first

Week 2-3: Validate against staging.

  • Wait for VPA to accumulate at least 7 days of data
  • Review Goldilocks recommendations for staging deployments
  • Apply 5-10 high-impact recommendations via PR
  • Confirm no regressions; the workloads should run as well or better with smaller requests

Week 4 onwards: Roll to production.

  • Label production namespaces, one team at a time
  • Establish a quarterly right-sizing review cadence
  • Build a runbook: "When Goldilocks shows a 3x+ overprovisioning gap, open a PR; for under-provisioning, escalate to the team to investigate why"

The spec for an explicit recommendation-only VPA on a Deployment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: my-app
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"   # recommendations only; no automatic updates
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]

The minAllowed and maxAllowed bounds prevent absurd recommendations (like 10m CPU for a workload that briefly spikes during garbage collection).

WAR STORY

A team I worked with had a backend service with requests: {cpu: 2, memory: 4Gi} set "to be safe" two years ago. Actual p95 usage was 200m CPU and 600Mi memory. With 30 replicas, that was 54 cores and 102Gi reserved across the cluster, against actual peak usage of 6 cores and 18Gi. Goldilocks recommended requests: {cpu: 300m, memory: 1Gi}. We applied it. The service ran better (more efficient bin-packing reduced cross-node networking calls). The cluster freed up nearly 50 cores and 90Gi for other workloads. This is one service. We had 80 like it. Lesson: most over-provisioning is invisible until a tool surfaces it.

Before and after

A typical 6-month VPA + Goldilocks rollout outcome:

MetricBeforeAfter
Cluster CPU allocated1,800 cores720 cores
Cluster CPU actual usage280 cores320 cores
Allocation efficiency16%44%
Number of right-sized deploymentsunknown180 of 200
Average overprovisioning ratio6.4x2.2x
Pod restarts from under-provisioning0/day1-2/day (acceptable; investigated case by case)
Monthly cluster cost$48K$26K

The 1-2 daily restarts after right-sizing is not a bug; it is the system finding the small number of workloads that genuinely need more resources than VPA recommended. Each one is a learning moment that gets applied back to the recommendation pipeline.

Common mistakes

  • Running VPA in Auto mode in production. VPA Auto restarts pods to apply new requests. For production workloads, this is too aggressive. Stay in Off mode and apply via PR.
  • Acting on early recommendations. VPA needs at least 7 days of data to produce stable recommendations. Acting on a 24-hour signal is acting on noise.
  • Right-sizing without bounds. A workload with brief, real spikes (GC pauses, batch jobs) produces low average usage but needs headroom. minAllowed prevents the recommendation from going below safe levels.
  • Skipping the staging validation phase. Roll out Goldilocks to staging first, build confidence, then apply to production.
  • Treating right-sizing as a one-time project. Workloads change. Traffic patterns evolve. Right-sizing without a quarterly review cadence is right-sizing once and decaying back to over-provisioned within a year.
  • Applying recommendations without app-team review. The platform team should propose; the app team should approve. They know things about their workload that the recommender does not.
  • Ignoring under-provisioning recommendations. When Goldilocks says "you should request more," it usually means the workload is being throttled or coming close to OOMKill. Take it as seriously as the over-provisioning recommendations.
INTERVIEW QUESTION

How would you right-size resource requests for 500 deployments across a cluster without manually profiling each one?