kubectl drain Killed a 90-Second Inference Request. Stateless Drain Logic Doesn't Work for GPU Pods.
Draining a GPU node in the middle of a long inference request is how you teach your users what 503 looks like. A stateless pod evicts in seconds; a vLLM pod has a minute of cold start and requests in flight for two. The three things a production drain needs — a real grace period, a preStop that drains the engine, and a readiness gate that fails the instant drain starts — plus why the same pattern is load-bearing for spot preemption, autoscaler downscaling, and rolling upgrades.
You drain a GPU node to patch the kernel. It is the same kubectl drain you have run a thousand times on web nodes, and it has never caused a problem. This time your on-call channel lights up: a wave of 503s, a batch of half-streamed responses that cut off mid-sentence, and a p99 latency graph that looks like a cliff. Ten minutes later it settles. You patched one node and caused a customer-visible incident.
The drain did exactly what it does for a stateless service, and that is the problem. A stateless web pod is disposable on a sub-second timescale: evict it, and the next request lands on another replica that was already warm. A GPU inference pod is not disposable like that. It has a 30-to-90-second cold start before it serves a single token, and it may be holding requests that take 60 to 120 seconds to finish — long-context generations, agent loops, multi-step structured output. Evict it the way you evict a web pod and you kill live work and pay a cold start to recover, all at once.
Most platform teams have never adapted their drain logic for this. They treat GPU pods like stateless web pods during maintenance, and they have a class of incidents that keeps happening every time a node cycles. This post is the three things that make a drain graceful for inference, the exact pod spec that does it, and why the same pattern is what saves you on spot preemption, autoscaler downscaling, and rolling fleet upgrades.
Why stateless drain logic breaks on inference pods#
A drain is an eviction, and eviction was designed around an assumption: pods are cheap to move. Two properties make that true for a typical web service and false for inference.
In-flight requests are long. A stateless request completes in milliseconds, so the window where eviction can interrupt live work is vanishingly small. An LLM generation is open for seconds to minutes. At any given instant a busy inference pod is holding dozens of streams mid-decode. Evicting it does not interrupt a tiny window — it severs every connection in the active batch.
Replacement is slow. When you evict a web pod, the capacity it represented is effectively already elsewhere; another warm replica absorbs the load. When you evict an inference pod, the replacement pod is not ready. It pays the full startup tax first:
- Pull a large inference image (CUDA runtime, framework, often tens of GB).
- Fetch model weights (a 70B model in FP16 is ~140GB).
- Load weights into HBM, build CUDA graphs, run warmup passes.
That sequence — the same cold-start tax the LLM autoscaling post is built around — is 30 to 90 seconds on a good day, longer if weights come over the network. So a naive drain does not just drop the requests on the dying pod; it removes capacity that takes a minute-plus to rebuild, right when the load it was serving has to go somewhere.
Stateless eviction assumes two things that are false for inference: that in-flight work is short enough to ignore, and that replacement capacity is effectively instant. Inference requests run for minutes and replacement pods cold-start for a minute. A graceful drain has to respect both — let live requests finish and keep enough warm capacity that the ones you shed land somewhere ready. Drain logic tuned for stateless pods does neither.
The naive drain, step by step#
Here is what kubectl drain $NODE does to a vLLM pod with default settings, and why each step hurts:
- The API marks the pod for deletion and the eviction begins. With the default
terminationGracePeriodSeconds: 30, the pod has 30 seconds, total, to disappear. - kubelet sends
SIGTERMimmediately (there is nopreStophook to run first). A well-behaved server begins graceful shutdown — but 30 seconds is not enough for a 90-second generation to finish, so when the grace period expires kubelet sendsSIGKILLand the process dies with requests still open. Users on those streams see a connection reset, a 503, or a response that simply stops mid-token. - Meanwhile, the pod is still in the Service's Endpoints for a second or two after termination starts, because endpoint removal is asynchronous. New requests routed in that window hit a pod that is already shutting down — more 503s, this time for requests that never even got to start.
- The replacement pod schedules elsewhere and cold-starts. For 30-90 seconds it consumes a GPU and serves nothing. The load that was on the drained pod queues at the load balancer behind a pod that is not ready, and p99 time-to-first-token spikes for everyone.
Two distinct failures are stacked here: live requests killed by SIGKILL (step 2), and new requests routed to a dying pod during the endpoint-convergence gap (step 3). A graceful drain has to close both, and they need different fixes.
What a production drain needs: three pieces#
1. A grace period long enough to finish real work#
The default terminationGracePeriodSeconds: 30 is the single most common cause of killed inference requests. Set it to comfortably exceed your longest realistic request:
spec:
template:
spec:
terminationGracePeriodSeconds: 180 # not the default 30
Pick the number from your own latency data — the p99 (or p99.9) of end-to-end request duration, plus headroom for the endpoint-convergence delay below. If your longest legitimate generations run 120 seconds, 180 gives margin. If you run agent loops that can stretch to several minutes, set it higher. The cost of an over-long grace period is small (a drained pod lingers a bit longer); the cost of one that is too short is killed requests on every drain.
terminationGracePeriodSeconds is the total termination window, and the clock starts when termination begins — the preStop hook runs inside it, not before it. If your preStop sleeps 10 seconds for endpoint convergence and your grace period is 180, the server has roughly 170 seconds after SIGTERM to drain, not 180. Size the grace period as preStop duration + longest in-flight request + margin, or your careful drain will still get SIGKILLed at the deadline.
2. A preStop hook that drains the inference engine#
The grace period gives the pod time; the preStop hook tells it what to do with that time. You want the engine to stop accepting new work and finish what it is holding, and you want new traffic to stop arriving before in-flight work winds down. A preStop hook runs before SIGTERM and blocks termination until it returns, which makes it the right place to sequence this.
The robust, framework-agnostic version does two things in order — flip the pod to unready, then wait out endpoint convergence:
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# 1. Signal "draining" so the readiness probe starts failing
# and the LB removes us from rotation.
touch /tmp/draining
# 2. Wait for kube-proxy / EndpointSlice convergence so no new
# request is routed here after we stop being ready.
sleep 15
# SIGTERM follows automatically when preStop returns; the
# server then finishes in-flight requests and exits.
After preStop returns, kubelet sends SIGTERM. Modern inference servers handle SIGTERM as graceful shutdown out of the box — vLLM's API server (uvicorn underneath) stops accepting new connections and lets in-flight requests complete before exiting, and SGLang behaves the same way. Because step 1 already took the pod out of rotation and step 2 waited for that to propagate, there are no new requests left to refuse — the only thing left when SIGTERM arrives is the in-flight batch, which finishes within the remaining grace period.
If your engine version does not drain cleanly on SIGTERM, or exposes an explicit "stop admitting, finish current, exit" control, call that endpoint in the hook instead of relying on signal handling — the sequence (go unready → converge → drain engine → exit) is what matters, not the specific mechanism.
3. A readiness gate that fails the instant drain starts#
This is the piece most teams miss, and it is what closes the step-3 gap. The load balancer stops sending traffic to a pod when the pod goes unready, not when kubelet finishes terminating it. If readiness only flips at process exit, there is a window where the pod is terminating but still in Endpoints, and new requests pour into a shutting-down process.
The fix is to make readiness depend on the same "draining" signal the preStop hook sets, so the pod reports itself unready the moment drain begins:
readinessProbe:
exec:
command:
- /bin/sh
- -c
# Unready the instant the drain marker exists; otherwise check
# the engine's real health endpoint.
- "! test -f /tmp/draining && curl -sf http://localhost:8000/health"
periodSeconds: 2
failureThreshold: 1
Now the sequence is tight: preStop touches /tmp/draining, the very next readiness check (within ~2 seconds) fails, the EndpointSlice controller removes the pod, and the sleep 15 in preStop covers the propagation lag before any request can be misrouted. Keep the readiness probe fast (periodSeconds: 2, failureThreshold: 1) so the unready transition is near-immediate — a slow probe reopens the gap you are trying to close.
This is a deliberately different concern from your liveness probe, and conflating the two is its own outage. The probes-done-wrong post covers why a liveness probe must never gate on load or drain state — here you want readiness to flip on drain while liveness keeps the process alive long enough to finish its work.
The full pod spec#
Putting the three pieces together on a vLLM deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama
namespace: inference
spec:
replicas: 4
template:
spec:
# 1. Real grace period: preStop sleep (15s) + longest request (~150s) + margin
terminationGracePeriodSeconds: 180
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
# 3. Readiness fails the instant the drain marker appears
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "! test -f /tmp/draining && curl -sf http://localhost:8000/health"
periodSeconds: 2
failureThreshold: 1
# Liveness gates only on the process being alive — never on drain/load
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
failureThreshold: 6 # tolerate a busy engine; do not self-kill
# 2. preStop: go unready, wait for convergence, then let SIGTERM drain
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "touch /tmp/draining && sleep 15"
resources:
limits:
nvidia.com/gpu: "1"
The four-step drain it produces#
With this spec, kubectl drain plays out cleanly:
- Drain starts,
preStopfires, pod marks itself unready (/tmp/drainingcreated; next 2-second readiness check fails). - The Service removes the pod from rotation — 1-2 seconds for the EndpointSlice controller and kube-proxy to converge — and the
preStopsleep 15covers that lag, so no new request is routed here. - The pod completes its in-flight batch.
SIGTERMarrives afterpreStopreturns; the engine refuses new connections (there are none) and finishes the requests it is holding, up to the remaining grace period. - Clean exit, or
SIGKILLat the deadline. If everything finishes inside the grace period — the normal case — the process exits 0. The deadline is a backstop, not the expected path.
Done right, a drain produces zero user-visible errors. A handful of users wait one extra second while endpoints converge; nobody gets a 503 and nobody gets a truncated stream. The tell that your drain is correct is boring: you cycle a node during peak traffic and the error-rate graph does not move. If draining a node is ever an event worth watching, the drain logic is still stateless-shaped.
Why this pattern is load-bearing far beyond manual drains#
A manual kubectl drain for a kernel patch is the rare case. The same graceful-shutdown machinery is what protects you in three situations that happen constantly, often unattended:
Spot / preemptible GPU preemption. Spot GPUs are 60-70% cheaper, which is why the GPU cost optimization story leans on them — but the cloud can reclaim the node on 30 to 120 seconds of notice. That warning fires a node-termination signal; a handler (the AWS Node Termination Handler, GKE's graceful shutdown, or your own DaemonSet watching the metadata endpoint) cordons and drains the node, which triggers exactly the preStop → unready → finish sequence above. The catch is the budget: spot's warning window can be as short as 30 seconds, which may be less than your in-flight requests need. So on spot you also size requests against the warning window, checkpoint or cap long generations, and accept that the shortest-notice preemptions will truncate some work. Graceful drain turns "every preemption is an incident" into "most preemptions are invisible, the rare short-notice one sheds a little work."
Cluster autoscaler / Karpenter downscaling. When the node loop decides a GPU node is underused and removes it, it drains the node first. If your inference pods do not drain gracefully, every scale-down event during a lull becomes a burst of 503s — and on GPUs you scale down constantly to give back expensive idle capacity. The asymmetric, slow scale-down from the autoscaling post only pays off if the scale-down itself is graceful; otherwise you have traded a cost saving for a reliability cost.
Rolling upgrades of an inference fleet. Every time you roll a new model version, bump the engine, or change the pod spec, Kubernetes drains old pods to replace them. Without graceful drain, a routine deploy interrupts live generations across the whole fleet — the upgrade is a fleetwide drain. Pair the graceful shutdown here with a conservative PodDisruptionBudget and a sane maxUnavailable so you never drain more inference capacity at once than your warm headroom can absorb; the PDB-drain post covers the failure mode on the other side of that dial, where an over-strict PDB makes the drain hang forever instead.
In all three, the unit of safety is the same: a pod that, when asked to leave, stops taking new work, finishes what it holds, and exits clean — and a fleet that keeps enough warm capacity to absorb what the leaving pod was carrying.
Common mistakes#
Leaving terminationGracePeriodSeconds at the default 30. Shorter than a single long generation. Guarantees SIGKILLed requests on every drain. The first and most common bug.
No preStop hook, relying on SIGTERM alone. Even if the engine drains gracefully on SIGTERM, you still have the endpoint-convergence gap: new requests route to the pod for a second or two after termination starts. The preStop unready-then-wait is what closes that gap.
Readiness that only flips at exit. If the pod stays "ready" until the process dies, the load balancer keeps feeding it new requests through the entire shutdown. Readiness must fail the instant drain begins, not when it ends.
Forgetting preStop runs inside the grace period. A 60-second preStop sleep under a 60-second grace period leaves the server zero time to drain before SIGKILL. Budget preStop + longest request + margin into the grace period.
A liveness probe that kills the draining (or just busy) pod. If liveness gates on the same health endpoint under tight thresholds, a saturated engine fails liveness and kubelet restarts the pod mid-drain. Keep liveness loose and gating only on "process is alive"; see probes done wrong.
No warm headroom to catch shed load. A perfectly graceful drain still removes a pod's worth of capacity for the duration of a cold start. If you run at 100% utilization with no buffer, the requests the drained pod is no longer taking pile onto pods that are already full. Graceful drain and warm headroom are two halves of one pattern.
Draining too many pods at once. A rolling upgrade or an aggressive autoscaler can drain several inference pods simultaneously, blowing past what your headroom absorbs even with perfect per-pod drain. Bound concurrent disruption with a PodDisruptionBudget and conservative maxUnavailable.
Assuming spot warning ≥ request length. Spot preemption can give as little as 30 seconds, shorter than a long generation. Graceful drain handles the common case; for the short-notice tail you must cap or checkpoint long requests, not assume they will finish.
The mental model#
A stateless pod is a cattle-grade unit of capacity: identical, disposable, instantly replaceable, so eviction can be brusque and nobody notices. A GPU inference pod is neither instantly replaceable (it cold-starts for a minute) nor instantly disposable (it is holding minutes of live work), and the entire bug class comes from drain logic that still treats it like cattle.
Graceful drain restores the two properties the eviction model assumed. Disposable is restored by the preStop-plus-grace-period sequence: the pod stops taking new work, finishes what it holds, and leaves on its own terms instead of being killed mid-request. Replaceable is restored by warm headroom: enough spare capacity that the load the pod was carrying lands somewhere already running, instead of queueing behind a cold start. Get both and a drain is a non-event — which is the goal, because drains are not rare. Every kernel patch, every spot reclaim, every scale-down, every deploy is a drain. The question is never whether your GPU nodes will be drained; it is whether each drain is invisible or an incident.
Treat GPU pods like stateless web pods during maintenance and you have signed up for the second one, on a schedule set by your cloud provider and your release cadence. The fix is three lines of pod spec and a buffer you were probably already paying for.
The full GPU node lifecycle — graceful drain, spot preemption handling, cluster-autoscaler and Karpenter downscaling, multi-zone resilience, and rolling upgrades for inference fleets — is part of the Production GPU Infrastructure course. The spot-vs-on-demand economics that make preemption-safe draining worth the effort are the GPU Cost Optimization course — and the post Spot H100s Are 70% Cheaper. Most Teams Use Them Wrong. walks the cost model that this drain logic unlocks, and the pod/node autoscaling loops that drain feeds into are the LLM Inference on Kubernetes course. The drain and eviction mechanics underneath it all are the Production Kubernetes Operations course. Related reading: kubectl drain Has Been Running for 4 Hours for the PodDisruptionBudget side of node maintenance, Your HPA Scales LLM Pods on CPU for the cold-start tax and the warm-buffer strategy that makes shed load land safely, and The Six Probe Mistakes That Cause Real Outages for getting readiness and liveness right so the drain gate works.
More in LLM Infrastructure
You Changed the Prompt. Is the Model Better or Worse? You Don't Have a Test That Tells You.
Operating an LLM in production is not MLOps and it is not traditional ops. It is running a non-deterministic component with no ground-truth notion of 'correct,' where a one-line prompt edit is a deploy and there is no green checkmark that says it's safe to ship. The operational surface that replaces the unit test: evals as your test suite, prompts and model versions as deployable config, RAG freshness, observability for systems that have no 'wrong answer' to alert on, and rollout you can't fully validate offline.
Read postYour GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.
Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.
Read postYour HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.
The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.
Read post