Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.
The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.
The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.
Both engines hit similar throughput on similar hardware in 2026. The decision is workload shape (agents vs chat vs RAG), structured output needs, and operational maturity. Here is the honest production comparison.
vLLM prefix caching is great. It stops at one node. When your fleet of 50 H100s is bottlenecked on KV cache and adding GPUs is not financially viable, the next architecture is disaggregated KV cache. Here is the wall, the math, Mooncake, and what to actually do on Monday.