Production LLM Inference on Kubernetes

Gateway vs Engine: The Two-Layer Architecture

The first time most teams deploy LLM inference in production, they put vLLM behind an ingress and call it a day. It works. They celebrate. A month later the same team is running six vLLM replicas, each handling auth, rate limiting, request logging, tenant routing, model selection, retries, streaming timeouts, cancellation propagation, prompt rewriting, and, somewhere in between all of that, actually running an inference.

At that point every bug is everyone's bug. A rate-limit change requires a GPU redeploy. A streaming timeout regression requires a 40-minute model warmup. Adding a new customer requires a PR against the inference service.

The second deployment every team does, the one that survives, splits the stack into two layers: a stateless gateway in front and a GPU-bound engine behind it. This lesson is about why that split exists, what goes on which side, and the specific production problems it solves.

KEY CONCEPT

Gateway = everything that isn't running the model. Engine = only the model. The moment you entangle API concerns (auth, routing, quotas, streaming semantics) with GPU concerns (batching, KV cache, tensor ops), every deploy risks both. The split is not optional at scale, it's the single architecture decision that separates prototypes from production.

The monolithic default: and why it breaks

A typical first deployment:

  Client → Ingress → vLLM pod (GPU) → Response

One process handles everything. At low volume, it works. The problems show up in three phases.

Phase 1: the first weird request

A user sends a 50KB prompt. vLLM's tokenizer accepts it, the request enters the batch, and the prefill stage pins the GPU for 800ms. Every other concurrent request, including the 100-token chats from other users, gets queued behind it.

Diagnosis: you need rate limits and token-count quotas before a request reaches the GPU. But the only thing in front of the GPU is your ingress, which has no idea how many tokens a request contains.

The fix you try: add a sidecar. Now the sidecar has to understand the tokenizer, which is model-specific, which means it has to be redeployed every time the model changes.

Phase 2: the first multi-tenancy ask

Product wants two customers sharing one GPU cluster with quota isolation. Each one pays differently, each has different rate limits, each needs separate logging and billing.

Diagnosis: the vLLM process now needs to know about tenants. It needs to authenticate them, count their tokens, enforce their quotas, log their usage. All of that code is now running inside the GPU process.

The fix you try: you add middleware inside the vLLM Python entrypoint. Now every deploy of that middleware is a GPU deploy. Every pip install for the tenant-billing library affects GPU warmup. You're modifying a serving binary that takes 40 seconds to start up just to change a rate-limit rule.

Phase 3: the first model swap

Your team wants to try a new model. You want to canary it to 5% of traffic, compare latency, roll back if the quality is worse.

Diagnosis: the routing logic needs to live somewhere. It can't live in the engine, every engine only runs one model at a time. The ingress doesn't know which model to pick.

The fix you try: a second ingress layer, or a Python service in front that proxies to the right vLLM. Congratulations, you just built a gateway. Now the question is whether you build it intentionally or by accident.

The two-layer architecture

The critical property: the two layers deploy on different cadences, scale independently, and own disjoint failure modes.

What belongs in the gateway

Anything that fits at least one of these tests:

It has nothing to do with the model or the GPU. Auth, quotas, logging, billing, these are business concerns, not inference concerns.
It changes frequently. Rate limit rules change weekly. Model weights change monthly. Don't entangle them.
It needs to survive the engine going away. When you roll the engine, the gateway returns a 503 with a useful error, not a TCP reset.
It's cheap to run on CPU. You don't want to burn GPU memory on YAML parsing.

Concrete gateway responsibilities:

Auth / API keys / SSO: verifying who made the request.
Rate limits + quotas: enforcing tenant-level policy. Counting both request rate and token rate (distinct!).
Request routing: which engine pool handles this model? Which geography? Which canary branch?
Streaming semantics: SSE framing, backpressure, client disconnect propagation, timeouts. The engine streams bytes; the gateway decides what "timeout" means.
Observability: request logs, billing events, usage metrics. The business cares about requests, not GPU ops.
Protocol adaptation: OpenAI-compatible API in front of an engine that has its own native protocol. Model name aliasing. Parameter validation.
Retries: on engine 5xx, retry to another engine replica. The engine should never retry against itself.
Fan-out: embeddings over many chunks, multi-agent calls, speculative decoding orchestration.

PRO TIP

A useful litmus test: if I pointed curl at the gateway with a nonsense model name, should it return a 400 before touching the engine? Yes. So model validation belongs in the gateway, not the engine.

What belongs in the engine

Anything that runs on the GPU, or anything that the GPU directly depends on:

Tokenizer: tied to the model, performance-critical, runs in the same process for simplicity.
Scheduler: continuous batching, prefill/decode interleaving, priority.
KV cache manager: block allocation, prefix sharing, eviction.
Prefill + decode kernels: the actual math.
Streaming generator: emits tokens as they're decoded.

Engine responsibilities should stop there. If you find yourself writing auth middleware for vLLM, you've leaked gateway into engine.

The rule of thumb

Can you put it in a sidecar running on a CPU node? Then it goes in the gateway.

Separation of deploys: the operational payoff

This is where the architectural split pays for itself. Gateway and engine have very different deploy cadences.

Gateway deploys

Small binary, CPU pod, ~100 MB image.
Starts in seconds, readiness check is trivial.
Rolling update: replace one of 30 pods at a time, health-checked.
Zero-downtime is trivial because pods are stateless and fungible.
Deploys happen tens of times a day, every PR that ships.

Engine deploys

Large binary, GPU pod, multi-GB image (model + CUDA libs).
Starts in minutes (model load, GPU warmup, CUDA graph capture).
Rolling update: replace one of 8 pods at a time, with 2-3 minutes of unavailability per replica.
Traffic shift is a separate step (canary %, error-budget-gated).
Deploys happen a few times a month: only when the model, engine, or driver changes.

WARNING

Putting gateway logic inside the engine forces every rate-limit change to go through a multi-minute GPU deploy. Teams that have done this once remember forever, every oncall incident that starts with "just roll back the rate-limit config" turns into a 20-minute partial outage.

Scale independence

The two layers scale differently, too.

Gateway scales with QPS. Thousands of concurrent connections, mostly idle waiting on streaming responses. 4-8 CPU cores per replica, lots of small pods.
Engine scales with GPU-hours. Tens of concurrent batched requests per GPU, bound by KV cache and tensor throughput. Few large pods, tied to GPU availability.

If you entangle them, autoscaling signals conflict. The gateway wants to scale out on connection count; the engine wants to scale out on GPU utilization. Trying to horizontally scale a single pod on both signals is a losing game.

The separate layers let you:

Autoscale the gateway on QPS, CPU, or concurrent streams, cheaply, on CPU pods.
Autoscale the engine on a completely different signal: GPU utilization, queue depth, SLO burn: which matters for actual inference throughput.

Fault isolation

Engines fail in ways gateways don't. A GPU can fall off the bus. A CUDA kernel can OOM. Model weights can corrupt on disk. Driver can segfault.

When the engine fails:

Gateway stays up. It returns a clean 503 with a useful error. Clients can retry against a healthy replica.
Gateway observability works. You still see the error rate climb in dashboards. You still get the structured logs.
Gateway can fail over to a secondary engine pool in a different zone.

If it's all one process, an engine crash takes down the entire request path, including your observability path. You lose the ability to see the failure just when you need it most.

The vLLM OpenAI server trap

vLLM ships with an api_server that speaks OpenAI's REST API. It's genuinely useful for prototypes. It also tempts teams into shipping it as production.

That API server is a minimal gateway glued to the engine in one process. It does:

Basic request handling.
OpenAI-compatible response formatting.
Some parameter validation.

It does not do:

Auth beyond a static token.
Rate limiting beyond a single in-process counter.
Multi-tenant quotas.
Model routing.
Canary deployments.
Proper streaming timeouts.
Metered logging.
Graceful shutdown.

KEY CONCEPT

The vLLM OpenAI server is a great starting point and a terrible endpoint. It blurs the gateway/engine boundary because it's in the same process as the engine. The first time you want to change an auth policy without rolling the engine, you will want to rip it out. Plan for that from the start.

What a minimal production gateway looks like

The smallest useful gateway, the thing you can ship in a week, does five things:

Auth via shared secret / JWT / mTLS. A middleware that rejects unauthenticated requests.
Request-count rate limits via a Redis-backed limiter, keyed by tenant.
Token-count awareness: the gateway tokenizes (or estimates) the prompt, so it can reject a 50KB prompt before it hits the GPU.
Model routing: a lookup from model string to engine pool URL.
Streaming proxy with timeouts and proper client-cancellation propagation.

That's ~500 lines of Go or Python. It buys you every one of the operational benefits above.

As the product grows, the gateway grows, but the engine stays focused. That asymmetric growth pattern is the whole point.

A reference topology

Client
  │
  ▼
Load balancer (L7, cloud-managed)
  │
  ▼
Gateway replicas (CPU, 20-50 pods)
  ├── auth, quotas, routing, logging
  │
  ├─▶ Engine pool A (Llama-3-70B, 8× H100 pods)
  ├─▶ Engine pool B (Llama-3-8B, 4× A100 pods)
  └─▶ Engine pool C (embedding model, 2× L4 pods)

Engine pools are per-model.
Gateway replicas are any-model: they can route to any pool.
Adding a model means adding a pool + a routing rule. Zero changes to the other pools.

Common objections

"But this is more complexity!"

Yes, one more service. In exchange, you get independent deploys, independent scaling, independent failure domains, and the ability to onboard a new model without touching the existing ones. The complexity is front-loaded; the payoff is every future deploy being safer.

"Can't we just use an ingress for this?"

An ingress handles L7 routing and maybe rate limits. It does not tokenize. It does not understand model parameters. It does not do token-based quotas. It does not adapt protocols. Past a single model and a single tenant, you will end up adding a real application in front of the ingress. Call it a gateway and be done with it.

"Our vendor's model does all this for us."

If you're using a managed inference API (Bedrock, Vertex, OpenAI), the vendor runs the gateway and the engine. You don't need this architecture. This course is for teams self-hosting the engine, which is exactly when the split matters.

Quiz

KNOWLEDGE CHECK

You have vLLM running with its built-in OpenAI API server, handling production traffic. Your product manager asks to launch a new rate-limit policy by end of week. What is the most dangerous thing about this request in your current architecture?

What to take away

Production inference is two layers: a stateless gateway and a GPU-bound engine. The split is not optional at scale.
Gateway owns API concerns: auth, quotas, routing, streaming semantics, logging, billing, protocol adaptation.
Engine owns model concerns: tokenizer, scheduler, KV cache, kernels, streaming generation.
Different deploy cadences: the gateway rolls continuously (minutes), the engine rarely (model/driver changes). Don't couple them.
Different scale axes: gateway scales on connections/QPS (cheap CPU); engine on GPU utilization (expensive).
Independent failure domains: an engine crash shouldn't take your logging and error responses with it.
The vLLM built-in API server is a prototype aid, not a production endpoint. Plan to replace it before launching.

Next lesson: what actually happens between the HTTP request arriving and the first token streaming back, the full lifecycle through both layers.

Continue

The Inference Request Lifecycle

←→ navigateM toggle sidebar