All posts
LLM Infrastructure

Your Team Is Debating vLLM vs SGLang. The Performance Numbers Are Not the Decision.

Both engines hit similar throughput on similar hardware in 2026. The decision is workload shape (agents vs chat vs RAG), structured output needs, and operational maturity. Here is the honest production comparison.

By Sharon Sahadevan··13 min read

You are picking the inference engine for the next generation of your LLM platform. Someone on the team has shipped vLLM since 2023 and wants to keep going. Someone else has been running SGLang for a structured-output workload and thinks it is the better default. A third person says you should just use TensorRT-LLM since you are on H100s anyway. You open a tab to find a benchmark that will settle it.

The benchmarks will not settle it. By 2026, the per-token throughput numbers between vLLM and SGLang on the same model and the same hardware are within 10 to 20 percent of each other. The order flips depending on workload shape, prompt distribution, and what features each release has shipped this quarter. If you pick based on a benchmark that was tuned for a workload that does not look like yours, you will deploy the wrong one.

The real decision is workload shape. Are your prompts repetitive or unique? Do you need structured output? Are you serving chat, RAG, or agents? How much operational maturity does your team have? Each of these pushes the decision in a specific direction, and none of them shows up cleanly on a throughput chart.

This post is the honest production comparison: where vLLM still wins, where SGLang has overtaken it, where the other engines fit, and the four-question decision framework that gets you the right answer in five minutes.

What both engines share#

Before the differences, the shared baseline. Any modern LLM inference engine in 2026 does the same six things, and vLLM and SGLang both do them well:

  • Continuous batching. New requests admitted at every forward-pass step; finished sequences evicted immediately. Standard since 2023.
  • Paged KV cache. Fixed-size KV pages, no fragmentation, fine-grained sharing across requests with common prefixes.
  • Tensor parallelism. Shard the model across N GPUs via NVLink/InfiniBand all-reduce.
  • Quantization. FP8 weights and KV cache, INT8, INT4 (AWQ, GPTQ). Both engines support the standard quantization formats.
  • Streaming. Token-by-token streaming via server-sent events or WebSocket.
  • OpenAI-compatible API. Both expose /v1/chat/completions and /v1/completions. Drop-in for clients that already speak OpenAI.

These are table stakes now. The differentiation is everywhere else.

The deeper foundations of continuous batching, KV cache, and PagedAttention are covered in Your LLM Cluster Is at 90% HBM and 60% Is KV Cache and the Tuning vLLM gpu_memory_utilization post; this one is about engine choice on top of those primitives.

Where vLLM wins#

vLLM has been the production default since 2023, and most of the reasons still hold in 2026.

1. Model coverage and time-to-support. When a new model family ships (Llama 4, Qwen 3.5, Mistral Large 2, the latest open frontier model), vLLM almost always has it within days. SGLang typically follows within one to four weeks. If you need to ship a model the day it is released, vLLM is the safer bet.

2. Hardware coverage. vLLM has first-class support for NVIDIA (H100, H200, B200), AMD MI300X, Intel Gaudi, AWS Inferentia/Trainium, Google TPU (via the experimental TPU backend). SGLang is NVIDIA-first and has been slower to expand. If your fleet is heterogeneous or includes AMD/Inferentia, vLLM wins by default.

3. Ecosystem maturity. vLLM integrates with Ray, KServe, Kubernetes operators (vllm-production-stack, KubeRay), Prometheus exporters, distributed-tracing instrumentation, and most observability stacks. SGLang has caught up on the basics but still has gaps in the operator ecosystem.

4. Multi-LoRA at production scale. Both engines support LoRA. vLLM's multi-LoRA serving (loading dozens of adapters on a single base model) is more mature, with better hot-reload and adapter-rotation behavior.

5. Distributed serving across nodes. vLLM's distributed inference (pipeline parallelism plus tensor parallelism across nodes) has been load-bearing for production deployments since 2024. SGLang has shipped similar features but with less production scar tissue.

6. The conservative-platform-team factor. If you have an SRE team that values "battle-tested" over "latest features," vLLM is the lower-risk choice. More people have hit the same bugs. More postmortems are public. More documented production patterns exist.

Where SGLang wins#

SGLang started as a research project from LMSYS and the Berkeley team behind vLLM (sister project, not a fork). It is the engine to choose when your workload has specific shapes that vLLM serves less well.

1. RadixAttention for prefix-heavy workloads. This is the biggest differentiator. RadixAttention is SGLang's automatic prefix cache, implemented as a radix tree across all in-flight and recently-finished sequences on the node. For workloads where 60 to 90 percent of input tokens are shared prefixes (system prompts, RAG context, agent loops), SGLang's per-node prefix hit rate is consistently higher than vLLM's, often by 2 to 3x. Lower TTFT, lower compute per request, better cost economics. If your traffic looks like a chatbot or an agent, SGLang's prefix caching alone can pay for the switch.

2. Structured output, by a wide margin. SGLang ships XGrammar (and similar constrained-decoding backends) with near-zero overhead. JSON-mode, regex grammars, context-free grammars, custom schemas, all run at the same throughput as unconstrained generation. vLLM has structured-output support (via outlines, lm-format-enforcer) but it is meaningfully slower in throughput-constrained scenarios. If you generate structured output (function calling, tool use, JSON-typed responses) at scale, SGLang is the right engine.

3. Agent workloads. Agents make many small LLM calls in sequence, with long shared prefixes that grow with each iteration. RadixAttention is the perfect substrate for this. The SGLang frontend DSL (sglang.function, sglang.gen, sglang.select) lets you express multi-call chains in a way that the runtime can optimize end-to-end (e.g., reusing KV across calls, parallelizing branches). This is rare in vLLM-based stacks; it requires more work in the application layer.

4. Speculative decoding maturity. Both engines support speculative decoding (drafting tokens with a small model, verifying with the big one). SGLang has shipped this earlier and more aggressively, with better integration with EAGLE-2 and Medusa. For latency-sensitive workloads where speculative decoding helps, SGLang is slightly ahead.

5. Lower TTFT under prefix-heavy load. A consequence of point 1, but worth calling out separately. If your SLO is TTFT-dominated and your traffic has prefix repetition, SGLang typically hits the SLO with fewer GPUs than vLLM.

KEY CONCEPT

The single sharpest differentiator: RadixAttention. If 60%+ of your input tokens are shared prefixes, SGLang's prefix cache hit rate beats vLLM's by 2 to 3x on a single node. That alone often justifies the switch for chatbot, RAG, and agent workloads. For unique-prompt workloads (search summarization, content generation from varied inputs), the gap disappears.

The performance question, in 2026 terms#

The benchmarks-on-Twitter culture wants this to be a clear winner. It is not.

For typical chat workloads on H100 with Llama-70B at FP16 or FP8, per-token throughput between vLLM and SGLang is within 10 to 20 percent. Which one leads depends on:

  • The release version of each engine (both ship every few weeks; rankings flip).
  • The prompt distribution (prefix-heavy favors SGLang; unique-prompt is roughly even).
  • Whether structured output is in the mix (heavy favor to SGLang).
  • The tensor parallelism degree (both are competitive; minor differences at TP=8+).
  • The quantization format (FP8 vs INT8 vs AWQ; both engines support all of them, perf parity).

If you publish your own benchmark, here is what to actually measure:

  • TTFT p50, p95, p99 under realistic concurrency. Not at zero load; the tail is what matters.
  • Per-token decode latency (ITL) at your target concurrency.
  • End-to-end latency for your specific request mix. Synthetic prompts mislead.
  • Throughput at your latency SLO, not the engine's max throughput at unconstrained latency.
  • Prefix-cache hit rate on a realistic prompt distribution. If you do not measure this, you are not benchmarking the relevant axis.

A team that benchmarks against the wrong axis will pick the wrong engine. A team that does not benchmark at all will pick whichever one their lead engineer used last time. Both are common failure modes.

What about TensorRT-LLM, TGI, and the others?#

TensorRT-LLM (NVIDIA). The highest single-GPU throughput on NVIDIA hardware, often 20 to 40 percent above vLLM or SGLang for the same model and the same precision. Locked to NVIDIA. Complex to operate (custom engine builds per model and per precision, not a "just install" experience). NVIDIA's trtllm-serve has made it more accessible than it was in 2024, but the operational surface is still meaningfully larger than vLLM's. Reach for it when: you are 100% NVIDIA, you have a platform team that can operate it, and the throughput delta covers a meaningful GPU cost. Skip it when: you want a default, you have heterogeneous hardware, or your team is small.

TGI (HuggingFace). The conservative production option. Less feature velocity than vLLM or SGLang, but stable. Tight HuggingFace ecosystem integration. Reach for it when: you are deeply in the HuggingFace stack and prefer slow-and-steady releases. Skip it when: you need newer models fast, or you want the strongest community.

LMDeploy. Strong in China-heavy stacks, decent FP8 support, less common in Western production. Reach for it when: you are deploying Chinese open-source models (DeepSeek, Qwen, GLM) and want closer integration with the upstream community. Skip it when: your ecosystem is OpenAI-API-compatible and your team is Western-based.

Mooncake (Moonshot AI). Not really an inference engine, more a KV cache architecture you put behind vLLM or SGLang. Covered in Your LLM Cluster Is at 90% HBM and 60% Is KV Cache. Reach for it when: you are at the KV cache wall and need cluster-wide disaggregated cache.

The next-generation entrants. A handful of newer engines (vLLM's own next-gen scheduler rewrite, NVIDIA's Dynamo, others) are emerging. Worth watching, not yet worth defaulting to in 2026.

The four-question decision framework#

Five-minute decision. Answer these four questions; the engine follows.

Q1: What is your workload shape?

  • Chatbot, RAG, or agent (prefix-heavy, > 60% shared input tokens): SGLang.
  • Unique prompts (search summarization, content generation, varied inputs): vLLM or SGLang. Lean vLLM for the ecosystem.
  • Mixed: Start vLLM. Profile. Revisit if prefix hit rate matters.

Q2: Do you generate structured output at scale?

  • Yes (function calling, tool use, JSON responses on the hot path): SGLang.
  • Occasionally: Either. vLLM's outlines integration is adequate.
  • No: Either.

Q3: What is your hardware fleet?

  • NVIDIA only, with a platform team that can operate it, throughput-critical: TensorRT-LLM.
  • NVIDIA only, default case: vLLM or SGLang per Q1/Q2.
  • Heterogeneous (NVIDIA + AMD + Inferentia + others): vLLM.

Q4: What is your team's operational maturity?

  • Small team, prefers conservative defaults: vLLM. Less risk of surprises.
  • Mature platform team, comfortable with newer projects: SGLang where it fits.
  • Multi-engine acceptable: Run vLLM as the default fleet, SGLang as a specialized pool for agent/RAG workloads. This is increasingly common.

The honest summary: vLLM remains the right default for most teams. SGLang is the right choice for specific workload shapes (prefix-heavy, structured output, agents) and is overtaking vLLM in those use cases. TensorRT-LLM is the right choice for throughput-critical NVIDIA-only workloads with a team to operate it. The other engines are niches.

WAR STORY

A team I worked with was running vLLM for all their LLM workloads on H100. The platform was four months old and ran a chat product, a RAG product, and an internal agent system. GPU utilization was 70%, but GPU bill was alarming. They benchmarked SGLang for the agent workload alone. RadixAttention pushed prefix hit rate from 18 percent (vLLM, no sticky routing) to 71 percent (SGLang, default config). TTFT dropped from 1.4 seconds to 380 ms. Required GPU count for the agent pool went from 12 to 5. They kept vLLM for chat and RAG, ran SGLang as a dedicated pool for agents. Total GPU bill dropped 28 percent. Lesson: do not let "we already run X" prevent you from running Y for a specific workload that needs Y. Multi-engine is fine when the operational cost is worth the savings.

Common mistakes#

  • Benchmarking on synthetic prompts. Real traffic has prefix repetition, varying lengths, and bursty arrival. Synthetic benchmarks mislead you toward the engine that wins synthetic tests, which may not be the one that wins on your traffic.
  • Optimizing for max throughput at unconstrained latency. Max throughput is irrelevant if it requires p99 TTFT outside your SLO. Benchmark at the latency you actually need.
  • Picking by the latest Twitter benchmark. Both engines ship every few weeks. Rankings flip. The benchmark you saw last quarter is already out of date.
  • Single-engine fundamentalism. Multi-engine deployments (vLLM for general workloads, SGLang for agents/structured output) are increasingly common and increasingly cheap to operate. The "one engine to rule them all" instinct often costs more than it saves.
  • Ignoring structured-output overhead. vLLM with constrained decoding is meaningfully slower than vLLM without. If structured output is hot-path, that gap is part of your throughput math.
  • Underestimating switching cost. Both engines are OpenAI-API-compatible, but operational tooling (metrics, dashboards, autoscaling, deployment manifests) is engine-specific. Plan a quarter for a switch, not a sprint.
  • Skipping prefix-hit-rate metrics. If your engine does not expose prefix cache hit rate as a first-class metric, you are flying blind on the most important cost lever you have. Both vLLM and SGLang expose it; instrument it.

The mental model#

In 2023, picking the LLM inference engine was easy because vLLM was meaningfully better than anything else open-source. In 2026, it is a real decision with no universally right answer. SGLang has matured to the point where it wins specific use cases by wide margins. TensorRT-LLM has narrowed the gap on operational complexity. The "default vLLM" answer is still right for most teams, but the "always vLLM" answer is wrong.

The right framing is workload-first, not engine-first. Pick the engine that matches the shape of your traffic. Run more than one if the shapes diverge enough to justify it. Re-evaluate every six months, because the projects ship fast enough that the rankings change.

If you are starting greenfield today: vLLM as the default unless Q1 or Q2 in the framework above pushes you to SGLang. If you have an existing vLLM deployment and a new workload shape comes along: pilot SGLang for that workload before scaling vLLM. If you are NVIDIA-only with a serious throughput SLO and a team to operate it: benchmark TensorRT-LLM seriously, not as a fallback.

The engine choice is not the whole story. The KV cache architecture beneath it (Mooncake-style disaggregation), the tuning of the engine itself (vLLM gpu_memory_utilization), and the underlying GPU memory model (fragmentation) all matter at least as much. Picking the right engine and operating it poorly is worse than picking either engine and operating it well.


The full LLM serving architecture, including continuous batching, KV cache, prefix caching at scale, hallucination detection, and the FAANG-level interview framing of inference design, is covered in the LLM Operations for MLOps Engineers course. The Kubernetes-specific deployment patterns (operators, autoscaling, multi-LoRA, fleet routing) are part of the LLM Inference on Kubernetes course. The GPU foundations beneath all of this are the Production GPU Infrastructure course. Related reading: MIG vs Time-Slicing on Kubernetes for the GPU-partitioning decision that often paired with engine choice, Prompt Economics for the token cost model that decides whether the throughput you win actually shows up in the bill, and Your GPU Finishes a Request and Waits for the Slowest for the continuous-batching scheduler both engines are built on and why it is the real source of their throughput.

More in LLM Infrastructure

LLM Infrastructure··13 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

Read post
LLM Infrastructure··15 min read

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

Read post
LLM Infrastructure··16 min read

Your LLM Bill Tripled and Traffic Didn't. Welcome to Prompt Economics.

The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.

Read post