Production LLM Inference on Kubernetes

The Inference Request Lifecycle

Most engineers who debug LLM inference have a model of how it works that is roughly correct at 30,000 feet and wildly inaccurate in the middle. They know there's "tokenization," "the model runs," and "tokens come back." When something goes wrong: tail latency spikes, throughput collapses, a request never returns, that mental model is useless for localizing the cause.

This lesson walks the actual path a single HTTP request takes from client socket to streaming response. If you understand this end-to-end, every debugging session in the rest of the course clicks into place.

KEY CONCEPT

A single inference request touches at least six distinct systems: gateway, engine scheduler, tokenizer, prefill pass, decode loop, and streaming response. Each has its own queue, its own failure modes, and its own timing contribution. "The model is slow" is a diagnosis that takes three minutes to refine into "prefill on a batch of 4 with 3000-token context."

The 30-second version

A request goes through these stages in order:

We'll walk each stage and pin down where time and failures actually go.

Stage 1: HTTP arrives

Client opens a TCP connection (usually HTTPS), sends headers and body. By the time the gateway sees it, the cloud load balancer has:

Terminated TLS.
Applied any geo-routing or simple WAF rules.
Picked a gateway replica to forward to.

What can go wrong here: TLS handshake failure, connection limit on the LB, slow client body upload (you'd see read timeout in the gateway).

Timing: typically under 10ms on warm connections. Cold TLS handshakes can add 100-300ms.

Stage 2: Gateway validates and routes

The gateway receives the request. In roughly this order:

Decode headers: authenticate (API key, JWT, mTLS).
Parse body: JSON schema validation on the OpenAI-style request.
Identify tenant: usually from the API key.
Count tokens: the gateway either tokenizes or estimates. This is critical, quotas and routing often depend on token count.
Check quotas: request rate, token rate, concurrent requests.
Pick upstream: which engine pool handles this model? Which healthy replica?
Open upstream connection: usually a persistent HTTP/2 connection pool to the engine.
Forward request: send the body to the engine.

What can go wrong here:

Auth fails → 401.
Tenant over quota → 429.
Unknown model → 400.
No healthy upstream → 503.
Upstream connection refused → 502, probably retry.

Timing: typically 1-5ms if everything is warm. Cold path (no pre-tokenized prompt, quota check against a remote Redis) can add 10-50ms.

PRO TIP

A good gateway measures and logs each sub-stage's timing. "Request took 4.2ms in the gateway" is meaningless; "auth=0.3ms validation=0.6ms tokenize=1.8ms quota=0.4ms upstream_connect=0.1ms" is debuggable.

Stage 3: Engine accepts and tokenizes

The request lands on an engine replica. The engine's HTTP handler:

Deserializes the request.
Tokenizes: turns the prompt string into a list of token IDs. This is CPU work, done in a tokenizer that's bound to the model.
Constructs a request object with the token list, sampling parameters (temperature, top-p, max tokens), and stop conditions.
Hands off to the scheduler.

Timing: tokenization is usually ~1-3ms for a few hundred tokens, up to ~30ms for very long prompts. Slower than you'd think, tokenizers are not trivially fast.

What can go wrong:

Prompt exceeds max_model_len → 400.
Invalid sampling params → 400.
Tokenizer is slow on unusual characters → latency spike.

Stage 4: Scheduler queues and batches

This is where the real production complexity lives. The engine's scheduler is deciding which requests to run on the next GPU step.

The batch model

At any instant, the engine has a "running batch" of requests currently being decoded. It also has a waiting queue of new requests that haven't started yet.

On each scheduling step, the scheduler:

Looks at the running batch: how much KV cache does it need right now?
Looks at the waiting queue: can it admit one or more new requests given available KV cache blocks?
Decides: prefill or decode? The engine alternates between running prefill on new requests (compute-heavy) and running decode on existing ones (memory-heavy).
Updates the batch: adds admitted requests, evicts or swaps requests if pressured for memory.

If your request joins the running batch immediately, it goes to prefill next. If the batch is full or KV cache is exhausted, it waits in the queue.

Queueing: the first latency gotcha

Under load, queue time can dominate TTFT. A request that took 50ms of actual GPU time can easily spend 400ms waiting in the scheduler queue when concurrency is high.

The queue isn't FIFO, it's priority-weighted. Long prompts may be delayed in favor of short prompts, because admitting a long prompt requires a long prefill that stalls the batch. Schedulers are clever, and their cleverness is sometimes what you're debugging.

What can go wrong:

Queue builds up faster than it drains → TTFT climbs → user sees latency spike.
Long prompt blocks batch → short-prompt tail latency spikes.
KV cache pressure triggers swap → throughput drops.

WARNING

Queue depth is the single metric most predictive of tail latency. If your queue depth is sustained above zero, your p99 is not bounded by model speed, it's bounded by waiting time. Covered in the metrics lesson next.

Stage 5: Prefill

Prefill is the first forward pass of the model on the input prompt. It:

Runs the prompt tokens through the model in a single pass, all tokens attend to all earlier tokens.
Produces the KV cache: the key/value tensors for every layer, at every prompt position. This is what makes future decode steps fast.
Generates the first output token.

Prefill is compute-bound, it's matrix math over the full prompt length. Roughly, time scales linearly with prompt length (ignoring quadratic attention costs at extreme context lengths).

Timing:

Short prompt (100 tokens) on 70B model: ~50-100ms
Long prompt (4000 tokens): ~400-800ms
Very long prompt (32k tokens): multiple seconds, a meaningful fraction of your SLO

Prefill dominates TTFT, the time to first token, which is what the user perceives as "responsiveness."

KEY CONCEPT

Prefill is why long prompts hurt everyone in the batch, not just the long-prompt request. A 5-second prefill blocks every decode step for 5 seconds. If you're running chat alongside batch summarization, the chat users feel every batch prefill.

Stage 6: Decode loop

After prefill, the request enters the decode loop. Each iteration:

Takes the latest output token and runs a single forward pass.
Consults the KV cache for all previous positions.
Produces the next output token.
Emits the token downstream (streaming).

Unlike prefill, decode is memory-bound, almost all the work is fetching KV cache tensors from HBM and running a tiny amount of matmul. This is why decode is way less compute-efficient than prefill.

The critical property: continuous batching

Decode steps are where continuous batching pays off. The engine runs many requests together in one decode step, each producing one token per step, because the matmul overhead is tiny compared to the memory traffic. Adding more requests to the batch costs almost no extra time.

This is why throughput scales sublinearly with concurrency: decode is cheap to stack.

Timing per decode step:

Small batch (1 request), 7B model: 10-20ms per token.
Large batch (64 requests), 7B model: 15-25ms per step → 64 tokens produced in ~20ms.

When a request leaves the batch

A request finishes the decode loop when any of these happens:

Stop token: model emits the end-of-sequence token or whatever the stop condition is.
Max tokens reached: hit the request's max_tokens limit.
User cancels: client closed the connection, gateway propagates cancel.
Timeout: engine-enforced cap on generation time.

On finish, the scheduler frees the KV cache blocks. That frees slots for queued requests to enter the batch.

Stage 7: Stream to client

As tokens come out of the decode loop, the engine streams them back:

Engine emits a token to its HTTP response stream.
Gateway forwards the token to the client stream (usually SSE or chunked HTTP).
Client receives and displays / processes.

Streaming adds its own complexity:

Backpressure: if the client is slow, the engine has to wait or buffer. Slow clients can effectively pin KV cache slots, hurting other users.
Cancellation propagation: when the client disconnects, the gateway should cancel upstream so the engine can free KV cache.
Timeouts: there are at least three: connect timeout, TTFT timeout, inter-token timeout. Each wants different values.

Backpressure in practice

Decoding produces tokens faster than most networks can deliver them, especially on mobile. If the engine doesn't pause, its write buffer fills up, and eventually it blocks the decode loop for that request.

A well-designed gateway + engine pair uses flow control: the engine's write to the gateway blocks when the gateway's write to the client blocks. The request naturally slows. Its KV cache stays pinned, but the GPU isn't wasting cycles generating tokens no one is reading.

A realistic timing budget

For a 70B model on 8× H100, handling a chat request with a 500-token prompt and 200-token response:

Stage	Typical time
HTTP arrival	under 5ms
Gateway processing	2-5ms
Engine HTTP + tokenize	2-5ms
Scheduler queue	0-200ms (depends on load)
Prefill (500 tokens)	80-150ms
Decode (200 tokens at 20ms each)	~4s
Stream overhead	0-10ms per token

TTFT ≈ HTTP + gateway + tokenize + queue + prefill ≈ 90ms to 400ms

Total request time ≈ TTFT + decode ≈ 4-4.5s

Now consider what moves when things go wrong:

Queue swells under load → TTFT doubles. Model throughput unchanged; you're just waiting.
KV cache pressure → requests get swapped out → decode rate drops, TPOT goes up.
Long prompt in batch → other requests' per-step time goes up → every concurrent user feels it.
Slow client → request stays in batch longer → fewer slots available for others.

All of these are diagnosable only if you know which stage is which. Which is what Module 4 is about.

The gateway's view vs the engine's view

A fundamental asymmetry worth internalizing:

The gateway sees: wall-clock request time, HTTP status, streaming start, client disconnect.

The engine sees: queue time, prefill time, decode steps, KV cache events.

Neither side has the full picture alone. The gateway can tell you "request took 4.2s"; only the engine can tell you "1.8s of that was queue time because KV cache was full." Combining them is the whole point of Module 4.

PRO TIP

In production, every request should have a request_id generated at the gateway and propagated to the engine. Logs from both sides get correlated via this ID. Without it, debugging cross-layer issues becomes guesswork.

Cancellation: the most-missed lifecycle bit

Most teams forget that cancellation is part of the lifecycle. A client disconnects, what happens?

Without proper cancellation propagation:

Gateway notices disconnect, closes its upstream connection to the engine.
Engine's HTTP handler notices the connection died, eventually.
The request stays in the running batch, pinning KV cache until it times out.
You waste GPU time generating tokens no one is reading.

With proper cancellation propagation:

Gateway propagates cancel signal via HTTP/2 stream reset (or explicit DELETE).
Engine handler receives cancellation, asks scheduler to abort the request.
Scheduler removes the request from the batch, frees KV cache.
Next scheduling step can admit a new request into the freed slot.

This is a throughput feature, not just a cleanup nicety. Long-running requests with flaky clients can chew through your KV cache capacity if cancellation doesn't work. We'll cover this in the streaming APIs lesson.

Quiz

KNOWLEDGE CHECK

Your p99 time-to-first-token (TTFT) is 1.2s, way above your SLO. Your p99 time-per-output-token (TPOT) is perfectly healthy. You profile a typical failing request and find the engine only took 80ms of GPU time. Where is the missing time most likely going?

What to take away

A request passes through seven distinct stages: HTTP, gateway validate+route, engine accept+tokenize, scheduler queue, prefill, decode loop, stream response. Each has its own timing and failure modes.
TTFT = everything before the first decode step. Prefill dominates at high prompt lengths; queue time dominates at high load.
TPOT = per-step time in the decode loop. Continuous batching makes per-step time stay flat as batch grows, up to KV cache limits.
Prefill is compute-bound, decode is memory-bound. Long prompts hurt prefill and thus hurt TTFT for everyone in the batch.
Cancellation is part of the lifecycle. Without propagation, client disconnects leak KV cache and hurt throughput.
Every request should carry a correlation ID so gateway logs and engine logs reconstruct the full story.

Next lesson: which metrics actually matter on each side of the split, and which ones are lying to you.

Gateway vs Engine: The Two-Layer Architecture

Continue

Metrics That Actually Matter

←→ navigateM toggle sidebar