Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

By Sharon Sahadevan·May 31, 2026·14 min read

You batch eight requests onto one GPU because batching is how you get throughput. Seven of them are short — a quick classification, 30 tokens of output each. One is a long generation: 1,500 tokens. You run them together in one batch.

The seven short requests finish in a fraction of a second. Then their slots sit there, allocated, occupied, doing nothing, for the entire time it takes the eighth request to grind out its remaining 1,470 tokens. The GPU is "processing a batch of 8" the whole time, but for most of that window it is really processing a batch of 1 while seven slots' worth of compute and memory go to waste. Your dashboard says the GPU is busy. Your throughput says otherwise.

This is static batching, and the waste is not a tuning problem — it is structural. Every batch runs at the speed of its slowest member, and in LLM serving the slowest member is wildly slower than the median. The fix is a different scheduling model entirely: continuous batching, the technique that is the single biggest reason vLLM, SGLang, and TGI are several times faster than a naive serving loop. Every post in this series mentions it in passing. This one is the mechanism itself.

Why static batching wastes so much GPU#

To see the waste, you have to see how an LLM request actually runs. It has two phases:

Prefill: the model processes the entire input prompt in one big parallel pass and produces the first output token. Compute-heavy.
Decode: the model generates output one token at a time, each step consuming all previous tokens to produce the next. This is the long part — one forward pass per output token.

A request's runtime is dominated by decode, and decode length is unpredictable. One user asks for a yes/no answer (5 tokens). Another asks for a full code review (1,500 tokens). You cannot know in advance which is which.

Static batching (and its slightly smarter cousin, dynamic batching, which just waits a few milliseconds to collect a fuller batch) makes one fatal assumption: the batch is fixed for its entire lifetime. You assemble N requests, run them together, and return results when all N are done. The batch runs for as many decode steps as the longest sequence needs. Every sequence shorter than that finishes early and then occupies a dead slot — KV cache memory reserved, batch dimension padded — contributing nothing but holding resources until the batch drains.

KEY CONCEPT

The core defect of static batching: a request that finishes early cannot leave the batch, and a request that is waiting cannot join it. The batch is frozen from the moment it starts until the slowest member completes. With LLM decode lengths varying by 50x or more between requests, that means most slots in a batch are idle most of the time. You are paying for a full GPU and using a fraction of it.

The math is brutal. If your batch holds one 1,500-token generation and seven 30-token generations, the batch lives for 1,500 decode steps. The seven short requests are done after 30 steps and waste the remaining 1,470. That is roughly 86% of those seven slots' capacity, gone. And because the wasted slots also hold reserved KV cache, you cannot even admit new work into the freed memory — it is not freed until the batch ends.

What continuous batching actually does#

Continuous batching — also called iteration-level scheduling, introduced by the Orca paper (OSDI 2022) and now the default in every serious inference engine — changes the unit of scheduling from a request to a single decode iteration.

The scheduler runs a loop. On every iteration (every single forward pass that produces one token for each active sequence), it does three things:

Run one step for every sequence currently in the batch.
Evict any sequence that just emitted its stop token or hit its length limit — immediately, that same iteration. Its slot and its KV cache are freed right now.
Admit waiting requests into the freed slots — also immediately — running their prefill and folding them into the running batch for the next iteration.

The batch composition changes every token. There is no "batch lifetime." A sequence joins when there is room and leaves the instant it is done, independent of every other sequence. The 1,500-token request and the 30-token request coexist for 30 steps; then the short one leaves, a new request takes its slot, and the long one keeps going alongside fresh neighbors. No slot waits for the slowest. No freed memory sits locked behind a still-running sequence.

PRO TIP

The mental model: static batching is a bus that will not let anyone off until every passenger has reached their stop — so the person going one block rides the entire route. Continuous batching is a normal bus: people get on and off at every stop, and the seat you vacate is taken by the next person waiting at the curb. Same vehicle, same number of seats, vastly higher passenger throughput — because no seat rides empty waiting for the long-haul passenger to finish.

The throughput difference is not incremental. On realistic workloads with mixed sequence lengths, continuous batching delivers on the order of 5x the throughput of static batching at the same latency — and it is the headline result of the original vLLM work. This is the reason a modern engine beats a hand-rolled model.generate() loop, and it is why the throughput numbers in the vLLM vs SGLang comparison are even in the conversation.

Why it needs PagedAttention#

Continuous batching has a hard dependency, and it is the reason it did not exist for years before vLLM: you cannot run it with naively-allocated KV cache.

In a static batch, you know the batch's shape up front, so you can allocate one big contiguous block of KV cache memory for the whole batch and pad to the max length. But continuous batching has sequences of constantly-changing length joining and leaving every iteration. A sequence that just joined needs a few KV blocks; one that has been decoding for 1,000 steps needs many. They enter and exit at different times. Contiguous pre-allocation is impossible — you would either massively over-reserve (defeating the point) or constantly reallocate and copy (too slow).

PagedAttention solves this by managing KV cache the way an OS manages RAM: in fixed-size blocks that need not be contiguous, allocated on demand as a sequence grows and freed instantly when it leaves. This is what makes "admit and evict every iteration" cheap enough to do per-token. The two techniques are co-designed — continuous batching is the scheduling innovation, PagedAttention is the memory innovation that makes it runnable, and the KV cache wall post and the fragmentation post are both about what happens to that block pool under pressure. You cannot have one without the other.

The prefill problem and chunked prefill#

Continuous batching introduces a new headache: prefill and decode are very different shapes of work, and mixing them naively causes stalls.

Decode steps are small and fast — one token per sequence. Prefill is large — it processes a whole prompt at once. When a new request with a 4,000-token prompt joins the batch, its prefill pass is far heavier than the decode steps of its neighbors. If the scheduler runs that full prefill in one iteration, every already-running sequence stalls while it completes. One user's long prompt spikes the latency of everyone else's in-flight generation. This is head-of-line blocking, and it shows up as ugly TTFT and inter-token-latency variance under mixed load.

Chunked prefill is the fix: instead of processing a long prompt in one giant pass, the scheduler splits it into smaller token-budget chunks and interleaves those chunks with the ongoing decode steps across several iterations. A big new prompt no longer freezes the batch; it is fed in a few thousand tokens at a time while everyone else keeps decoding. It smooths latency at a small throughput cost and changes the memory profile (a long prompt no longer needs its entire KV allocation up front) — which is exactly why the vLLM tuning post flags that you must re-tune gpu_memory_utilization after enabling it.

The knobs that control it#

Continuous batching is automatic in vLLM, but three parameters shape its behavior, and they are the ones to reach for when throughput or latency is off:

max_num_seqs            # max sequences in the running batch at once (vLLM default 256)
max_num_batched_tokens  # token budget processed per iteration (prefill + decode)
enable_chunked_prefill  # split long prefills across iterations to avoid HOL blocking

max_num_seqs caps how many sequences run concurrently. Higher means more throughput — until the KV cache cannot hold that many sequences' worth of blocks and the scheduler starts preempting. Raising this without the HBM to back it just trades throughput for preemption thrash.
max_num_batched_tokens is the per-iteration token budget shared between prefill and decode. Set it too high relative to KV cache and you will see preemption even at modest concurrency (the tuning post calls this out directly). Too low and you leave throughput on the table.
enable_chunked_prefill trades a little peak throughput for much smoother latency under mixed prompt lengths. On chat and agent workloads with variable prompt sizes, usually worth it.

These are also why the autoscaling signals work the way they do: vllm:num_requests_running fluctuates because continuous batching admits and evicts every iteration, and it is exactly the live batch-occupancy signal the autoscaling post scales on.

The trade-offs nobody mentions#

Continuous batching is close to a free lunch, but not entirely:

Scheduling overhead. Re-planning the batch every iteration is not free. For tiny models or trivially short sequences the per-iteration bookkeeping can eat into the win — though in practice the throughput gain dominates for anything realistic.
Latency variance. A request's neighbors change every token, so its inter-token latency depends on what else is in the batch at each step. A request that started alone and fast can slow down as the batch fills around it. Average throughput goes way up; individual-request timing gets noisier. This is why you watch TTFT and inter-token-latency distributions, not means.
Preemption under memory pressure. When the KV cache pool fills (because continuous batching aggressively admits work), the scheduler must preempt running sequences — recompute or swap their KV state — which causes a latency cliff. This is the failure mode behind the KV cache wall and the reason gpu_cache_usage_perc is a leading signal to scale on before the cliff hits.
Fairness. A flood of new short requests and a few long-running generations compete for slots every iteration. Default schedulers are roughly first-come; if you need fairness or priority across tenants, that is policy you have to add on top, not something the batcher gives you for free.

WAR STORY

A team I worked with had built their own inference service on a plain Hugging Face generate() loop with dynamic batching — collect requests for 50ms, run the batch, return when all finished. It worked in their load tests, which used uniform 256-token completions. In production, real traffic had a long tail: most requests were short, but a few asked for 2,000-token outputs. Those long requests pinned entire batches open, and the short requests behind them piled up. GPU utilization read near 100% the whole time, so they kept buying GPUs, but throughput per GPU was a fraction of what the hardware could do — most of every batch was idle slots waiting on one long generation. Migrating to vLLM and letting its continuous batching scheduler admit and evict per-token raised effective throughput roughly 4x on the same hardware. They went from planning a GPU expansion to giving GPUs back. The uniform-length load test had hidden the entire problem; the long tail was where all the waste lived, and only iteration-level scheduling could reclaim it.

Common mistakes#

Benchmarking with uniform sequence lengths. Continuous batching's win comes entirely from variance in decode length. A load test where every request generates the same number of tokens makes static and continuous batching look nearly identical — and hides the 4-5x gap that real, variable traffic exposes. Always benchmark with a realistic length distribution.

Rolling your own batching loop. "We just batch requests ourselves" almost always means static or dynamic batching, which means you are leaving most of your GPU on the table. Iteration-level scheduling plus paged KV cache is genuinely hard to build correctly; use an engine that already has it.

Cranking max_num_seqs without the HBM to back it. More concurrent sequences only helps if the KV cache can hold them. Past that point you trade throughput for preemption thrash and latency cliffs.

Ignoring chunked prefill on mixed-length traffic. Without it, one long prompt joining the batch stalls everyone else's decode. If your TTFT and inter-token latency spike whenever a big prompt arrives, this is usually why.

Reading mean latency instead of the distribution. Continuous batching makes per-request timing depend on batch composition, so the mean hides real tail behavior. Watch p95/p99 TTFT and inter-token latency.

The mental model#

Static batching treats a batch as a fixed convoy: everyone departs together and nobody arrives until the slowest vehicle does. It was inherited from how we batch in training, where every example in a batch genuinely does the same amount of work. Inference broke that assumption — decode lengths vary wildly and unpredictably — and static batching never adapted, so it wastes the difference.

Continuous batching throws out the convoy and schedules one token at a time across a fluid set of sequences that join and leave continuously. Each sequence runs exactly as long as it needs and no longer; each freed slot is immediately refilled; no compute or memory waits on the slowest member. It needs paged KV memory to make the per-iteration churn cheap, and chunked prefill to keep big new prompts from stalling the batch — but with those, it converts the long-tail variance that crippled static batching from a liability into something the scheduler simply absorbs.

That is the whole reason a modern inference engine is several times faster than a loop you would write yourself, on the exact same GPU. Not a better kernel, not a bigger model — a better schedule. When you read "vLLM gets 5x throughput," continuous batching is the noun that sentence is about. Everything else in the LLM serving stack — KV cache architecture, autoscaling signals, GPU observability — sits on top of this scheduling model and assumes it is there.

Continuous batching, PagedAttention, chunked prefill, and the full inference-engine scheduling model — plus how to tune and deploy them on Kubernetes — are covered in the LLM Inference on Kubernetes course and the LLM Operations course. The GPU foundations beneath the scheduler are the Production GPU Infrastructure course. Related reading: vLLM vs SGLang for Production in 2026 for the engines that implement this and where they differ, Your LLM Cluster Is at 90% HBM and 60% Is KV Cache for the paged memory that makes continuous batching possible and what happens when it saturates, Tuning vLLM gpu_memory_utilization for the knobs (max_num_seqs, max_num_batched_tokens, chunked prefill) that shape the batch, Your HPA Scales LLM Pods on CPU for using the live batch-occupancy signal continuous batching produces as an autoscaling trigger, and You Changed the Prompt. Is the Model Better or Worse? for the behavioral-reliability layer that sits on top of this serving stack.

More in LLM Infrastructure

LLM Infrastructure·Jun 10, 2026·13 min read

You Changed the Prompt. Is the Model Better or Worse? You Don't Have a Test That Tells You.

Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.

Read post

LLM Infrastructure·Jun 9, 2026·16 min read

kubectl drain Killed a 90-Second Inference Request. Stateless Drain Logic Doesn't Work for GPU Pods.

Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs (a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts), plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.

Read post

LLM Infrastructure·May 30, 2026·16 min read

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

Read post