All posts
LLM Infrastructure

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

By Sharon Sahadevan··15 min read

Traffic spikes 3x at 9am like it does every weekday. Your LLM inference deployment is supposed to scale up to absorb it. You open the dashboard during the next spike to watch it work.

It does not work. The HPA sits at its minimum replica count through the entire ramp. GPU utilization reads 100% — it always reads 100% — so by the only signal the autoscaler can see, nothing is wrong. Meanwhile request queue depth climbs, p99 time-to-first-token blows past your 2-second SLO, and vLLM starts preempting in-flight sequences to make room. Ten minutes later, after the spike has already burned through your error budget, a pod finally scales up. It spends 90 seconds loading model weights before it serves a single token. By the time it is ready, the spike is over.

You have built an autoscaler that reacts to the wrong signal, too slowly, after the damage is done. This is the default state of LLM autoscaling on Kubernetes, and almost every team ships it before they understand why it cannot work.

This post is the three things you have to get right: the signal you scale on, the cold-start tax that makes reactive scaling a trap, and the two control loops (pod and node) that move at completely different speeds.

Why CPU and GPU utilization are both the wrong signal#

The Horizontal Pod Autoscaler ships watching CPU. For a stateless web service that is a reasonable proxy for load: more requests, more CPU, scale out. For LLM inference it is nonsense. The CPU on an inference pod mostly tokenizes input and shuffles bytes; the GPU does the work. A pod can be drowning in queued requests while its CPU sits at 15%. Scale on CPU and you will never scale at the right time.

So teams reach for the obvious fix: scale on GPU utilization instead. This is worse, because it looks right. As the DCGM observability post lays out in detail, DCGM_FI_DEV_GPU_UTIL reports whether a kernel was running, not how much work the GPU is doing. Autoregressive decode keeps a stream of small kernels resident, so GPU utilization pins near 100% under almost any steady load — one request or a thousand. It is saturated at light load and saturated at overload. A signal that reads 100% in both the state where you should scale up and the state where you should not is not a control signal at all. It is a constant.

KEY CONCEPT

LLM serving has no useful hardware utilization signal for autoscaling. CPU is decoupled from the real work; GPU utilization is pinned at 100% and tells you nothing. The signals that actually track load live one layer up, in the inference server itself: how many requests are queued, how full the KV cache is, and how long the first token is taking. Scale on the serving layer, never on the silicon.

The signals that actually track load#

Every serious inference server (vLLM, SGLang, TGI) exposes a Prometheus /metrics endpoint, and the metrics on it are what you scale on. Using vLLM's names, the ones that matter:

vllm:num_requests_waiting       # requests queued, not yet running — the headline signal
vllm:num_requests_running       # requests in the active batch right now
vllm:gpu_cache_usage_perc       # fraction of the KV cache block pool in use (0..1)
vllm:time_to_first_token_seconds  # TTFT histogram — your latency SLO, measured
vllm:request_queue_time_seconds   # how long requests wait before prefill starts
vllm:e2e_request_latency_seconds  # end-to-end latency histogram

Each answers a different scaling question:

  • num_requests_waiting is the most direct load signal there is. If requests are queuing, you do not have enough capacity, full stop. It rises before latency degrades, which is exactly what you want from a scaling trigger — it leads, it does not lag.
  • gpu_cache_usage_perc is the one specific to LLM serving, and it is the one most teams miss. The KV cache is a fixed-size block pool (see Tuning vLLM gpu_memory_utilization for how its size is set). When it fills, vLLM cannot admit new sequences without preempting running ones — recomputing or swapping their KV state, which causes a brutal latency cliff. You want to scale up before the cache saturates, so a target of around 0.7–0.8 on this metric gives you runway to add a pod before the cliff. This is the leading indicator the KV cache wall post is about, used as a control signal.
  • time_to_first_token is the signal to scale on when you have a hard latency SLO. If your product promises sub-2-second first token, scale on the p95 of this histogram directly. It is the most user-facing signal but also the most lagging — by the time TTFT degrades, users already felt it. Use it as a backstop, not the primary trigger.

The hierarchy in practice: scale primarily on num_requests_waiting or gpu_cache_usage_perc (both lead), and alert/backstop on TTFT (which lags). The whole point is to act on the signal that moves first.

Wiring it up: KEDA over raw HPA#

The HPA cannot read vLLM's Prometheus metrics by itself. You need one of two bridges:

  1. Prometheus Adapter — exposes Prometheus queries as Kubernetes custom/external metrics that a stock HPA can consume. More moving parts, more YAML, but it is "just an HPA."
  2. KEDA — an event-driven autoscaler that wraps the HPA and speaks Prometheus natively. Less boilerplate, supports scale-to-zero, and is the production default for this in 2026.

KEDA scaling a vLLM deployment on queue depth, with KV-cache pressure as a second trigger:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-llama-scaler
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-llama        # the Deployment
  minReplicaCount: 2        # never below 2 — headroom, see cold-start section
  maxReplicaCount: 12
  cooldownPeriod: 300       # wait 5 min of calm before scaling down expensive GPUs
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 30   # react fast on the way up
        scaleDown:
          stabilizationWindowSeconds: 600  # react slowly on the way down
          policies:
          - type: Pods
            value: 1
            periodSeconds: 120             # remove at most 1 pod / 2 min
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: vllm_requests_waiting
      # Average queued requests per replica across the deployment
      query: |
        sum(vllm:num_requests_waiting{deployment="vllm-llama"})
          / count(vllm:num_requests_running{deployment="vllm-llama"})
      threshold: "5"        # target ~5 queued per replica; above it, add pods
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: vllm_kv_cache_pressure
      query: |
        avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama"})
      threshold: "0.75"     # scale before the KV cache saturates and preempts

The asymmetric scaleUp/scaleDown behavior is the most important part and the part most configs get wrong. Scaling up should be fast and eager — the cost of being one pod short is a broken SLO. Scaling down should be slow and reluctant — the cost of being one pod over is a few dollars of GPU time, but the cost of scaling down into a spike you misread is another full cold start to recover. On GPU workloads, asymmetry is not a tuning nicety; it is the whole strategy. Cheap to keep a pod, expensive to recreate one.

The cold-start tax: why reactive scaling is a trap#

Here is the thing that makes LLM autoscaling fundamentally different from web autoscaling, and the thing that breaks every "just scale on the right metric" plan: a new inference pod is not ready when it starts. It is ready a minute or two later. The startup sequence:

  1. Pull the container image. Inference images are huge — CUDA runtime, PyTorch, the serving framework. Tens of gigabytes if you are careless. On a cold node, minutes.
  2. Fetch the model weights. A 70B model in FP16 is ~140GB. Even from fast object storage at 10 Gbps that is roughly two minutes of pure transfer.
  3. Load weights into HBM and initialize. Copy to GPU, build CUDA graphs, run warmup passes.

The vLLM tuning post opens on exactly this: "the pod takes 90 seconds to start." That 90 seconds — often more — is dead time during which the pod consumes a GPU but serves nothing. If your autoscaler is purely reactive — wait for the queue to build, then add a pod — you have guaranteed that every spike is met at least 90 seconds late. For a 5-minute spike, you miss the first third of it entirely, which is the part where the SLO breaks.

Reactive autoscaling assumes new capacity is available roughly when you ask for it. For LLM inference that assumption is false, and no choice of metric fixes it. The fixes are structural:

PRO TIP

You cannot make the cold start instant, so stop trying to scale exactly to demand. Run with deliberate headroom (minReplicaCount and a target utilization that leaves a pod's worth of slack), so the buffer absorbs the spike while the new pod boots. The autoscaler's job is not to catch the spike — the buffer catches the spike — it is to replace the buffer you just spent before the next one. Autoscaling LLM serving is buffer management, not demand tracking.

The levers that shrink or hide the cold start, in rough ROI order:

  • Get the weights off the critical path. Do not bake them into the image and do not pull them from a slow bucket on every start. Cache them on a node-local NVMe hostPath, a regional read-many PVC, or a pre-warmed cache so a new pod loads from local disk, not the network. This is usually the single biggest win.
  • Slim the image. Keep weights out of it; keep build tooling out of it. A smaller image pulls faster, especially on a freshly scaled-up node.
  • Run a warm buffer. Set minReplicaCount and scaling targets so there is always ~1 idle pod of headroom. You pay for one extra GPU; you buy the cold start out of the user-facing path.
  • Scale predictively for known patterns. If load is diurnal — and inference load almost always is — do not wait for the morning spike to discover it. KEDA's cron scaler raises the floor before the ramp, so the pods are already warm when traffic arrives.
  # Add to triggers: pre-warm before the weekday morning ramp
  - type: cron
    metadata:
      timezone: America/New_York
      start: "45 8 * * 1-5"   # 08:45, raise the floor before the 09:00 spike
      end: "0 18 * * 1-5"     # hold until 18:00
      desiredReplicas: "6"

Predictive + reactive together is the production pattern: cron sets the floor for the load you can predict, queue-depth handles the surprises on top.

The two control loops nobody reconciles#

Even with the right metric and a warm buffer, there is a second trap: pods and nodes autoscale on completely different timescales, and people forget the node loop exists.

  • Pod loop (HPA/KEDA): add a replica. Fast if there is a GPU node with a free GPU to schedule it on — seconds to a minute.
  • Node loop (Cluster Autoscaler / Karpenter): if there is no free GPU, you must provision a new GPU node. That means cloud API call, instance boot, NVIDIA driver and device-plugin install, node-ready. Two to ten minutes, sometimes more for scarce GPU SKUs.

If your pod autoscaler asks for a replica and no GPU is free, the pod sits Pending while the node loop grinds — and then the new pod still pays its own cold start on top. Worst case you stack a 5-minute node provision and a 2-minute model load: seven minutes from "queue building" to "serving traffic." No SLO survives that reactively.

The reconciliation is to keep the slow loop ahead of the fast one. Maintain a small buffer of warm GPU nodes (or use Karpenter with provisioning headroom / low-priority placeholder pods that get preempted when real work arrives) so the pod loop almost always finds a GPU waiting. You are pre-paying for idle GPU capacity to convert a 7-minute reaction into a 1-minute one. On expensive accelerators that tradeoff has to be deliberate and costed — which is squarely the GPU cost optimization question — but pretending the node loop is instant is how you end up with pods stuck Pending through your busiest hour.

Scale-to-zero: yes for dev, no for prod#

KEDA can scale a deployment to zero replicas when idle, which is enormously tempting for GPU workloads — an idle H100 is ~$40K/year of nothing. For internal tools, dev environments, and rarely-hit endpoints, scale-to-zero is the right call: the first request after idle eats the full cold start, but nobody is paged over a 90-second wait on an internal tool, and the savings are real.

For production endpoints with a latency SLO, scale-to-zero is a trap. The first user after a quiet period pays the entire cold-start tax — image pull, weight load, warmup — and sees a multi-minute first token. You will have built a system that is fastest exactly when nobody is using it and slowest exactly when the first real user shows up. Keep minReplicaCount at 1 or more for anything user-facing. The "scale to zero saved us money" win is real only where "the first request is slow" is acceptable.

WAR STORY

A team I worked with autoscaled their chat inference fleet on DCGM_FI_DEV_GPU_UTIL with a target of 80%. It never scaled, because GPU utilization never dropped below 80% — it was pinned near 100% at every load level. They assumed the cluster was simply always busy and kept buying GPUs. The real problem surfaced when a marketing push tripled traffic in an hour: the fleet didn't add a single pod, the KV cache saturated, vLLM started preempting sequences, and p99 TTFT went from 1.2s to 19s. The fix was three changes: scale on vllm:num_requests_waiting and gpu_cache_usage_perc instead of GPU util, set an asymmetric behavior block (fast up, slow down), and raise minReplicaCount so a warm buffer absorbed spikes while new pods loaded. They also moved model weights from S3 to a node-local cache, cutting cold start from 110s to 35s. Same hardware budget, p99 back under SLO, and — because they could finally scale down safely on the slow side — their off-peak GPU bill dropped 30%. The autoscaler had been a no-op wired to a constant the whole time.

Common mistakes#

Scaling on CPU. The default HPA metric. Decoupled from GPU work. It will never fire at the right time for inference.

Scaling on GPU utilization. Looks correct, is useless — pinned at 100% across all load levels. The single most common LLM autoscaling mistake, covered in depth in the DCGM post.

Symmetric scale-up/scale-down behavior. Scaling down as fast as you scale up means a brief lull triggers a scale-down, and the next spike pays a full cold start to recover. Always slow the scale-down (long stabilizationWindowSeconds, conservative cooldownPeriod).

Forgetting the node loop. Pod autoscaling is meaningless if there is no GPU to place the pod on. Without a warm node buffer or provisioning headroom, scale-up requests stall on Pending for minutes.

Reactive-only scaling on predictable load. Inference load is diurnal. If you can predict the morning spike, pre-warm with a cron floor instead of discovering it reactively every single day.

Scale-to-zero on a latency-SLO endpoint. Makes the system slowest for the first real user. Fine for dev, wrong for production.

Thresholds that flap. Scaling on a noisy raw metric without averaging or a stabilization window makes the fleet thrash — and on GPUs, thrash means repeated cold starts. Smooth the signal and damp the response.

Ignoring KV-cache pressure. Scaling only on queue depth misses the workload where long contexts saturate the KV cache before the queue builds. Add gpu_cache_usage_perc as a trigger so you scale ahead of preemption.

The mental model#

Web autoscaling is demand tracking: capacity is roughly instant, so you follow load up and down in near-real-time and the only question is what metric proxies load. LLM autoscaling is demand buffering: capacity takes minutes to come online, so you cannot follow load — you hold a buffer that absorbs spikes, and the autoscaler's real job is keeping that buffer stocked. Get that framing right and every decision follows: scale on a leading serving-layer signal so you start refilling early, keep a warm pod and warm node buffer so the spike hits the buffer and not the user, scale down slowly so you never destroy a buffer you are about to need, and pre-warm the load you can predict.

The three numbers that run a good LLM autoscaler are queued requests, KV-cache utilization, and time-to-first-token — none of which is a hardware metric, all of which lead or measure the user experience directly. The GPU's own utilization counter, the one every dashboard graphs, is the one number that tells you nothing. The whole discipline is refusing to scale on the silicon and learning to scale on the serving layer instead.

Get the signal right, pay deliberately for a buffer that hides the cold start, and keep the node loop ahead of the pod loop. Do that and the autoscaler does what you always assumed it was doing: meets the 9am spike with capacity that is already warm, and gives the GPUs back when the day winds down.


Autoscaling LLM inference, the KEDA and Prometheus-Adapter wiring, cold-start mitigation, multi-LoRA and fleet routing, and the pod-vs-node control-loop reconciliation are part of the LLM Inference on Kubernetes course. The buffer-vs-cost tradeoff — warm nodes, spot capacity, and right-sizing the headroom — is the GPU Cost Optimization course. The GPU foundations beneath it all are the Production GPU Infrastructure course. Related reading: Your GPU Dashboard Says 100% Utilized. It's Lying. for why GPU utilization is the wrong scaling signal and what to watch instead, Tuning vLLM gpu_memory_utilization for the knob that sets the KV-cache size you scale against, Your LLM Cluster Is at 90% HBM and 60% Is KV Cache for what happens when no amount of pod scaling is enough, and Prompt Economics for the token-cost model that decides how much headroom you can afford to keep warm.

More in LLM Infrastructure

LLM Infrastructure··13 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

Read post
LLM Infrastructure··16 min read

Your LLM Bill Tripled and Traffic Didn't. Welcome to Prompt Economics.

The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.

Read post