Production LLM Inference on Kubernetes

Metrics That Actually Matter

Every first LLM inference dashboard has GPU utilization on it. Every mature one does not. GPU utilization is a seductive metric: green in Grafana, easy to alert on, looks like it means something. It is usually the least actionable number on the screen.

This lesson is about the metrics that actually predict production health: the two latency numbers that matter, the throughput metric that's a trap if you measure it wrong, the queue and KV-cache signals that tell you where pressure is, and the metrics that look meaningful but aren't.

KEY CONCEPT

Inference has two user-visible latency metrics: TTFT (time to first token) and TPOT (time per output token). Everything else: queue depth, KV cache utilization, batch size, is a leading indicator of those two. If your dashboard doesn't center TTFT and TPOT with SLO thresholds drawn on, you're flying blind.

The two latency metrics that matter

TTFT: Time To First Token

TTFT is the time from "client sent request" to "first token byte returned to client." For a chat product, this is perceived responsiveness. The user clicks Send; they see a cursor blinking; then tokens start appearing. That gap is TTFT.

TTFT = time_first_token_sent - time_request_received

TTFT is the sum of:

Gateway processing (~2-5ms)
Queue wait (0-500ms+ depending on load)
Prefill time (~50ms to seconds, depending on prompt length)
First decode step (~20ms)

When TTFT spikes, the question is: was it queue, or was it prefill? That's what Module 4 is about. But the SLO target you publish is TTFT, because that's what users feel.

Typical SLO targets:

Chat UI: p95 under 500ms, p99 under 1s.
Batch embeddings: no TTFT SLO (nobody's watching).
Real-time voice / agents: p95 under 200ms.

TPOT: Time Per Output Token

TPOT is the time between consecutive tokens in a streaming response. For a chat product, this is how fast the text scrolls. Tokens coming out at 50/s feels fast; tokens at 10/s feels like waiting.

TPOT = (last_token_time - first_token_time) / (output_tokens - 1)

TPOT is effectively the decode step time divided across requests in the batch, smoothed over the generation. It's a function of:

Model size and architecture.
Batch size (larger batch = higher TPOT, but higher aggregate throughput).
KV cache pressure (swapping → spikes).
GPU memory bandwidth (the underlying bottleneck).

Typical SLO targets:

Chat: p95 TPOT under 40ms (25 tokens/s feels fast on English).
Production agent / tool calling: p95 under 30ms.
Batch: no TPOT SLO; measure throughput instead.

PRO TIP

Publish TTFT and TPOT as separate SLIs. They have different fix mechanisms (TTFT ↔ scheduling + prefill; TPOT ↔ batch size + memory bandwidth) and conflating them loses signal.

Throughput: the metric everyone gets wrong

"Throughput" means something different to almost every team that talks about it. The three common definitions:

Requests per second (RPS): how many completed requests per second.
Output tokens per second (TPS): how many output tokens the engine is producing per second.
Total tokens per second: input tokens processed (prefill) plus output tokens produced.

Only one of these is a real throughput metric for an LLM engine.

RPS is misleading

RPS depends on output length. A fleet that produces 1,000 five-token responses is barely working; a fleet that produces 100 500-token responses is working 10× harder. RPS without output-length context is a rank-ordering of how many requests you let in, not how much work you did.

RPS matters for gateway capacity planning (how many connections, how much auth CPU). It tells you almost nothing about engine saturation.

Output TPS is the one that matters

Output tokens per second is the real throughput of an engine. Every output token requires a full forward pass through the model. If your engine is producing 2,000 output tokens/second, that's literal GPU work.

output_tps = sum(output_tokens_produced_per_request) / time_window

Measure it with a rolling counter in your engine. vLLM exposes it as vllm:generation_tokens_total (counter).

Input TPS matters too: separately

Prefill is compute, not memory, and it dominates when prompts are long. An engine doing a lot of summarization (long prompts, short responses) might be compute-bound on prefill without producing many output tokens.

Tracking input TPS separately (tokens processed during prefill per second) lets you see this. vLLM exposes vllm:prompt_tokens_total.

The combined view

Queue depth: the leading indicator

Queue depth is the single most predictive metric for production health. It tells you whether your engines have headroom or are about to tip into a latency spike.

What it is: the number of requests sitting in the scheduler's waiting queue, not yet admitted to the running batch.

Why it matters:

Queue depth of zero → requests run immediately → TTFT ≈ prefill time.
Queue depth of 10 → requests wait → TTFT = prefill + (some portion of decode cycle) × 10.
Queue depth growing monotonically → you are under-provisioned, latency will spike.

Where to get it: your engine. vLLM exposes vllm:num_requests_waiting (gauge).

SLO usage:

Alert if sustained queue depth > 5 for 2+ minutes.
Scaling signal: scale up when queue depth > threshold.

WARNING

Queue depth isn't something to average, it's something to watch the upper quantile on. A p99 queue depth of 12 tells you some requests are waiting behind 12 others, which is where your p99 TTFT comes from.

KV cache utilization: the second leading indicator

Queue depth rises when the engine can't admit requests. The most common reason: KV cache full.

What it is: the fraction of total KV cache blocks currently allocated.

kv_cache_usage = (allocated_blocks / total_blocks) * 100%

Why it matters:

Under 70% → engine has headroom, requests admit fast.
70-90% → admissions slow down, scheduler starts being selective.
90% → eviction / swap territory, TPOT gets spiky.
100% → no new admissions until something finishes.

Where to get it: vLLM exposes vllm:gpu_cache_usage_perc (gauge).

SLO usage:

Alert: sustained > 90% for 5 minutes (can't admit new work reliably).
Capacity planning: target average usage of 60-70% at peak load.

The KV-cache pressure → queue depth → TTFT chain

Batch size: the dashboard you'll look at during incidents

What it is: the number of requests currently in the running batch (being actively decoded).

Why it matters: batch size tells you how much aggregate work the engine is doing. An engine running at batch size 1 is wasting ~95% of its GPU throughput potential compared to batch size 64.

Where to get it: vLLM's vllm:num_requests_running (gauge).

SLO usage:

No hard SLO: batch size is a composition metric.
But dashboards should plot batch_size alongside output_tps, the relationship tells you if you're memory-bound or bandwidth-bound.

The batch-size curve

output_tps
  ▲
  │          ╭───────── (saturated)
  │       ╭──╯
  │     ╭─╯
  │   ╭─╯
  │ ╭─╯
  │╭╯
  └──────────────────────▶ batch size
         (more requests)

Throughput scales sublinearly with batch size, flattening when memory bandwidth saturates. The "knee" of the curve is roughly where you want to operate, adding more requests past it doesn't help, and starts to increase TPOT.

Metrics that look useful but aren't

GPU utilization: the classic trap

nvidia-smi reports GPU utilization. Your dashboard probably has it. It is almost always wrong about inference workloads.

Why: nvidia-smi utilization is "percent of time at least one SM was active." For decode-heavy workloads, the SMs are doing tiny amounts of work interspersed with memory fetches. The SMs register as "active," but they're idle waiting on memory.

A GPU at 98% utilization could be:

Fully saturated doing prefill (real work).
Decode-bound with no headroom (real work).
Memory-bound with 30% of SMs spinning on cache misses (not real work).

Look at output_tps and batch_size instead. Those are unambiguous.

"Queue latency" averages

If you see "average time in queue: 25ms," your instinct might be "queue is fine." It isn't. Queue time is bimodal, most requests wait 0ms; some wait 500ms. Averages obscure the tail.

Track queue time as a histogram. Publish p50, p95, p99. The p99 is what your TTFT SLO feels.

GPU memory "free"

GPU memory has a specific layout: model weights are fixed, KV cache is the dynamic part. "Memory free" includes slack reserved by the engine for bursts. It doesn't tell you whether you can admit more work.

Track kv_cache_usage_perc instead. That's the actual KV headroom metric.

Request success rate

HTTP 200 / total is obvious, but inference has weird failures that return 200 with broken content (empty response, stop token hit at 1 token, repetitive babble). Real "success" for LLM inference is content-level and hard to auto-measure.

Useful: track HTTP 5xx rate. Useful: track "response had 0 tokens" rate. Less useful: track a vague "success rate."

The dashboard you should actually have

A good inference overview dashboard has exactly these panels, in this order:

If you have all six of these and nothing else, you will outperform 90% of production inference dashboards.

Labeling: cardinality caveats

Metrics should be labeled with at least:

model, which model.
engine_pool, which replica pool.

They should probably NOT be labeled with:

tenant, cardinality can explode; track separately in billing pipeline.
request_id, absolutely not; every request becomes a new time series.
prompt_hash, same problem.

Covered in the observability course, but worth repeating, label cardinality kills Prometheus. Stick to low-cardinality labels.

Quiz

KNOWLEDGE CHECK

Your p99 TTFT just jumped from 200ms to 1.2s. Your dashboard shows: GPU utilization 98% (same as before), output TPS 800 (same as before), queue depth averaging 8 requests (was 0 before), KV cache 91% (was 45% before). What is the most likely root cause?

What to take away

User-facing SLOs: TTFT (p95/p99) and TPOT (p95/p99). Everything else is a leading indicator of these.
Throughput metric: output TPS. RPS alone is meaningless without tokens-per-request.
Leading indicators: queue depth and KV cache usage. These predict TTFT breaches before users notice.
Composition metrics: batch size and input TPS. Useful for understanding, not for alerting.
Avoid: GPU utilization (lies about memory-bound workloads), queue time averages (hide the tail), GPU memory free (wrong layer of abstraction).
Dashboard: six panels: SLO, throughput, queue depth, KV cache, batch size, error rate. One screen.
Label with model and engine_pool; avoid tenant, request_id, prompt_hash.

Next module: vLLM itself: every flag that matters, the math behind the defaults, and the flags most engineers leave at defaults that are costing them 30% throughput.

The Inference Request Lifecycle

Continue

vLLM Configuration Deep Dive

←→ navigateM toggle sidebar