All posts
LLM Infrastructure

Your LLM Bill Tripled and Traffic Didn't. Welcome to Prompt Economics.

The unit of cost in an LLM system is the token, and almost nobody is counting them. Output tokens cost 3-5x input. Your context window is 80% dead weight. This is the cost-per-request math, where the tokens actually go, and the levers that bend the curve — in ROI order.

By Sharon Sahadevan··16 min read

The CFO forwards you a graph. The LLM line item went from $40K a month to $130K over the last quarter. In the same window, request volume grew 22%. The question in the email is short: "Why is the bill up 3x when traffic is up a fifth?"

You pull the provider dashboard. It shows requests, latency, error rate. It does not show the thing that actually drives the bill. So you start instrumenting, and a week later you have the answer. Traffic is up 22%. Tokens are up 280%. Somewhere in the last quarter, three teams shipped features that each quietly tripled the size of the prompt: a bigger system prompt, more RAG chunks, longer chat history retained per conversation. Nobody was watching tokens, because nobody owned tokens.

This is prompt economics. The unit of cost in an LLM system is not the request and not the GPU-hour. It is the token. Requests are what your product team counts. Tokens are what you pay for. The gap between those two numbers is where the entire LLM cost problem lives, and most teams cannot even see it because their dashboards are built around requests.

This post is the cost model: how a request turns into dollars, why the two halves of a request are priced so differently, where the tokens actually go, and the levers that bend the curve — in the order you should pull them.

The unit of cost is the token, and the two halves are not equal#

Every LLM request has two token counts, and they are not priced the same.

  • Input tokens (prefill). Everything you send: system prompt, context, history, the user's message. Processed in one parallel forward pass. Cheap per token.
  • Output tokens (decode). Everything the model generates. Produced one token at a time, each step a full forward pass that reads the entire KV cache. Expensive per token.

Across the major providers in 2026, output tokens cost roughly 3x to 5x input tokens. That ratio is not arbitrary pricing — it reflects the hardware. Prefill is compute-bound and parallel: you can push 2K input tokens through one batched forward pass and saturate the GPU. Decode is memory-bandwidth-bound and sequential: generating 500 output tokens means 500 forward passes, each one re-reading the growing KV cache out of HBM. Decode is where the GPU spends its time, so decode is where the money goes.

This single fact reorders every optimization you will ever consider. The instinct is to trim the giant system prompt. Sometimes correct. But if your workload generates long outputs, the 200 output tokens you didn't need are worth more than the 800 input tokens you did. Count both, but weight output by the price multiplier.

KEY CONCEPT

Output tokens cost 3-5x input tokens because decode is sequential and memory-bandwidth-bound while prefill is parallel and compute-bound. When you hunt for cost, a generated token is worth several input tokens. "Make the model more concise" is often a bigger lever than "shorten the prompt" — and almost nobody measures output length.

The cost-per-request formula#

Here is the whole model. There is no more to it than this:

cost_per_request
  = (input_tokens  / 1_000_000) * input_price_per_M
  + (output_tokens / 1_000_000) * output_price_per_M

monthly_cost = cost_per_request * requests_per_month

Plug in real numbers. Take a mid-tier model at $3 per million input tokens and $15 per million output tokens — a 5x ratio, typical for 2026.

INPUT_PRICE  = 3.00   # $ per 1M input tokens
OUTPUT_PRICE = 15.00  # $ per 1M output tokens

def cost_per_request(input_tokens, output_tokens):
    return (input_tokens / 1e6) * INPUT_PRICE + (output_tokens / 1e6) * OUTPUT_PRICE

# A "small" RAG chatbot request that feels cheap:
#   system prompt 1,500 + 6 RAG chunks @ 500 = 3,000 + history 2,000 + user 200
#   output 400
inp = 1500 + 3000 + 2000 + 200   # 6,700 input tokens
out = 400                         # 400 output tokens

print(cost_per_request(inp, out))     # $0.0261 per request
print(cost_per_request(inp, out) * 5_000_000)  # 5M req/mo -> $130,500

$0.026 feels like nothing. Multiply by five million requests a month and it is the email from the CFO. No single request is expensive. The cost is always in the multiplication. This is why "it's just a few cents" reasoning destroys LLM budgets: the per-request number is below the threshold where a human's intuition fires, and the monthly number is above the threshold where the CFO's does.

Notice the shape of that example: 6,700 input tokens, 400 output. The request is 94% input by volume. But at a 5x output multiplier, the 400 output tokens cost $0.006 and the 6,700 input tokens cost $0.020 — output is 23% of the cost on just 6% of the tokens. Both halves matter, and you cannot reason about either without measuring both. Token counting per request, broken out by input and output, is the single instrument the dashboard is missing.

Where the tokens actually go#

Pull apart the input side of a typical production request and it looks like this. The user's actual message — the only part anyone thinks about — is usually the smallest slice.

Per-request input token budget (typical RAG chatbot)
+----------------------------------------------------------+
| System prompt + tool definitions      1,500   ~22%       |  reused every request, verbatim
+----------------------------------------------------------+
| Retrieved context (6 chunks @ 500)     3,000   ~45%       |  top-k often set by vibes
+----------------------------------------------------------+
| Conversation history (8 turns)         2,000   ~30%       |  grows unbounded if you let it
+----------------------------------------------------------+
| The user's actual question              200    ~3%        |  the only part the user typed
+----------------------------------------------------------+
  total input ~6,700 tokens

Three of those four blocks are dead weight you control:

  • System prompts and tool definitions. Identical on every request. A 1,500-token system prompt sent on 5M requests is 7.5 billion input tokens a month, every one of them the same bytes. This is what prompt caching exists to kill (next section).
  • RAG context. top_k is the most common unexamined cost driver in production. Someone set it to 8 during a demo because 8 felt safe, and it shipped. Each extra chunk is ~500 input tokens on every request, forever. Half the time the chunks past rank 3 are noise that also hurts answer quality.
  • Conversation history. If you replay the full transcript every turn, cost grows quadratically with conversation length — turn 20 carries turns 1 through 19. Unbounded history is the classic "why did this one power user cost $400 this month" bug.

The user's question is 3% of the input. Optimizing the prompt the user typed is optimizing the one part you cannot control and that costs almost nothing. The money is in the boilerplate.

The cost levers, in ROI order#

There are five levers that actually move the bill. Pull them in this order — cheapest and highest-return first.

1. Prompt caching (the biggest lever, by far)#

If a prefix of your prompt is identical across requests — and for system prompts and tool definitions it is byte-for-byte identical — you should be paying for it once, not on every request. Prompt caching is the mechanism. The provider keeps the KV cache for a marked prefix warm and charges cached input tokens at a steep discount, typically 10% of the normal input price (a 90% saving on that segment). Self-hosted, this is exactly the prefix-caching story from the KV cache wall: the same KV pages, reused instead of recomputed.

Worked example on the request above. Of the 6,700 input tokens, the system prompt (1,500) and tool definitions are static. Cache them:

CACHED_INPUT_PRICE = 0.30   # cached input often ~10% of standard input

def cost_with_cache(cached_in, fresh_in, output):
    return (cached_in  / 1e6) * CACHED_INPUT_PRICE \
         + (fresh_in   / 1e6) * INPUT_PRICE \
         + (output     / 1e6) * OUTPUT_PRICE

# 1,500 cached system prompt, 5,200 fresh (RAG + history + user), 400 output
print(cost_with_cache(1500, 5200, 400))   # $0.0228 vs $0.0261  -> 13% off

# Now imagine a workload with a HEAVY shared prefix:
#   8,000-token system + few-shot examples, 300 fresh, 200 output
print(cost_per_request(8300, 200))            # $0.0279 uncached
print(cost_with_cache(8000, 300, 200))        # $0.0069 cached  -> 75% off

The savings scale with how much of your prompt is shared. For a thin system prompt it is a modest win. For the increasingly common pattern of a huge static preamble (detailed instructions, tool schemas, 10-shot examples) followed by a tiny dynamic tail, prompt caching is a 50-75% cut on input cost for one config change and a stable prompt ordering.

PRO TIP

Caching only works on an exact prefix match, so prompt ordering is now a cost decision. Put everything static first — system prompt, tool definitions, few-shot examples — and everything dynamic last — retrieved chunks, user message. A single moving token near the front (a timestamp, a request ID, a reordered tool list) busts the cache for everything after it. Freeze the prefix.

2. Output token control (the lever nobody measures)#

Because output is the 3-5x-priced half, controlling it has outsized return — and almost no one instruments it. Three concrete moves:

  • Set max_tokens deliberately. Leaving it at the model default invites the model to ramble to the ceiling on the occasional request. The tail of your output-length distribution is real money.
  • Ask for less. "Answer in 2-3 sentences" or "return only the JSON" measurably shortens output. For structured extraction, constrained/structured output stops the model from wrapping the answer in 200 tokens of "Certainly! Here is the information you requested..."
  • Stop sequences. If you only need the first field, stop after it. You pay for tokens you generate, including the ones you throw away.

A workload that averages 600 output tokens and could do the job in 250 is paying ~2.4x on the expensive half of every request. That is frequently a larger, easier win than anything on the input side, and it is invisible until you start logging output length per request.

3. Context pruning — RAG top-k and history windows#

Attack the two input blocks that grow without anyone deciding they should.

  • Tune top_k. Measure answer quality as a function of retrieved chunks. Most RAG systems plateau at 3-4 chunks; everything past that is cost and often noise. Going from top_k=8 to top_k=4 on the example above removes 2,000 input tokens per request — a ~30% input cut that frequently improves accuracy by not burying the relevant chunk.
  • Bound conversation history. Cap retained turns, or summarize older turns into a short running summary. This converts quadratic transcript growth into a bounded window and kills the long-conversation cost blowup.

4. Batching and throughput (self-hosted only)#

If you run your own inference, cost per token is GPU-hours divided by tokens served. The lever is utilization: continuous batching (vLLM, SGLang) packs many concurrent decodes into each forward pass so the GPU is doing useful work instead of waiting on one sequence. A self-hosted stack at batch size 1 can be an order of magnitude more expensive per token than the same hardware saturated. This is downstream of tuning gpu_memory_utilization and of GPU memory fragmentation — fragmented HBM caps the batch size you can actually run, which caps tokens-per-dollar.

5. Model right-sizing and routing#

The most expensive token is one generated by a frontier model for a task a small model would have nailed. Two patterns:

  • Right-size per task. Classification, extraction, routing, and simple Q&A rarely need your largest model. Match model tier to task difficulty instead of defaulting everything to the flagship.
  • Cascade / route. Send every request to a cheap small model first; escalate to the expensive model only when the small one fails a confidence or validation check. For workloads where most requests are easy, a cascade can cut cost 60-80% while keeping flagship quality on the hard tail. The routing logic is the new cost-critical infrastructure — see vLLM vs SGLang for the engine layer this sits on.
WARNING

Do not start with model routing. It is the lever teams reach for first because it feels like the big one, but it adds a classifier, a fallback path, and a whole new evaluation surface — real engineering and real risk of quality regressions. Prompt caching and output control are config changes with no quality downside. Pull the cheap levers first; reach for routing only when caching, output limits, and context pruning are already in place and the bill is still too high.

Self-hosted vs API: the same math, a different curve#

The token model is identical whether you pay a provider per token or run your own GPUs — only the price-per-token changes. The decision is a break-even.

  • API: price per token is fixed and public, you pay only for what you use, zero idle cost, and the provider handles caching/batching/scaling. Cost scales linearly with usage forever.
  • Self-hosted: price per token is GPU-hours / tokens-served, dominated by utilization. High fixed cost (the GPUs run whether requests come or not), but at high, steady volume the per-token cost drops below API pricing.

The crossover is mostly about utilization. A self-hosted H100 that sits 30% idle is burning money the API model never charges you for; the same H100 saturated with continuous batching can beat API pricing several times over. The honest version of "should we self-host to save money" is "can we keep the GPUs busy" — which is a GPU cost-optimization problem, not a prompt problem. Most teams should ride APIs (and harvest the cached-input discount) until volume is high and steady enough that utilization math flips.

You cannot optimize what you do not measure#

Every lever above depends on instrumentation the default dashboard does not give you. Before optimizing, log per request:

  • input tokens, output tokens (separately — they have different prices)
  • cached vs fresh input tokens (your cache hit rate, in dollars)
  • model used (for routing and right-sizing analysis)
  • a feature/tenant tag (so you can attribute cost to the team that shipped the 8-chunk RAG)

Then track two derived numbers:

  • Cost per request, sliced by feature and tenant. Averages lie — one feature at 99% cache hit and one at 5% average out to a number that describes neither.
  • Cost per successful task. The metric that actually matters. A cascade that retries failed requests can have a higher cost-per-request but a lower cost-per-successful-task, because it stops paying repeatedly for wrong answers. Optimizing cost per request alone can quietly make your product worse.
WAR STORY

A team I worked with was convinced their cost problem was the model — they were mid-migration to a cheaper provider to cut the bill. We added per-request token logging with a feature tag before the migration shipped. The data killed the project. 70% of spend came from a single internal feature that pre-fetched the entire knowledge base into context "to be safe," stuffing 40K input tokens into every request to answer questions that needed maybe 2K. Nobody had looked because the dashboard showed requests, and the request count for that feature was tiny — a few thousand a day. But each request was 20x the token weight of a normal one. We capped retrieval to top-4 and turned on prompt caching for the static instructions. The bill dropped 60% in a week. The cheaper model would have saved 20% and taken a quarter. The lesson: measure tokens by feature before you touch the model. The expensive thing is almost never the thing you assumed.

Common mistakes#

  • Counting requests, not tokens. Request volume can be flat while cost triples. If your dashboard shows requests and not input/output tokens, you are flying blind on the one number that is the bill.
  • Treating input and output as one number. Output is 3-5x the price. A model that reads 6,000 tokens and writes 100 is a completely different cost profile from one that reads 1,000 and writes 1,000. Lumping them hides the lever.
  • Never setting max_tokens. The output-length tail is real money, and the default ceiling is generous.
  • Unbounded conversation history. Quadratic growth per conversation. This is the "one user cost us $400" bug, every time.
  • top_k set by vibes. The single most common unexamined cost driver in RAG. More chunks is more cost and frequently worse answers.
  • Busting the prompt cache with a moving prefix. A timestamp or request ID near the front of the prompt silently disables caching for everything after it. Static first, dynamic last.
  • Reaching for model routing first. It is the highest-effort, highest-risk lever. Caching and output control are free wins. Do the easy 50% before the hard 30%.
  • No cost attribution. If you cannot tell which feature or tenant drives spend, you cannot fix it, and you cannot tell the team that shipped the regression.

The mental model#

LLM cost engineering is the same discipline as every other infrastructure cost problem, with one unit swapped in. In databases you optimize queries and I/O. In networking you optimize bytes on the wire. In LLM systems you optimize tokens — and the discipline is identical: find the unit of cost, measure it relentlessly, attribute it to the feature that drives it, and pull the cheap levers before the expensive ones.

The trap specific to LLMs is that the per-request cost is always below the threshold where a human notices, and the monthly cost is always above the threshold where finance does. That gap is bridged by one habit: count tokens per request, split input from output, tag by feature, and review it like you review latency. Do that and the CFO's email answers itself before it gets sent. Skip it and you will keep discovering, one quarter at a time, that a feature you forgot about quietly tripled the size of every prompt.

Start with prompt caching. Then cap and shorten output. Then prune context. Then, if you self-host, saturate the GPU. Then, last, route between models. In that order, most teams cut their bill by half or more without touching the thing they assumed was the problem.


The full token cost model — pricing structure, prompt caching, context budgeting, and cost-vs-quality trade-offs at interview depth — is the Cost Engineering lesson in the LLM Operations course. The fundamentals underneath it are in Tokens and The Context Window. For the self-hosted side of the curve, The True Cost of a GPU in GPU Cost Optimization. Related reading: the KV cache wall for cluster-wide prefix caching (the self-hosted version of the prompt-cache discount), Tuning vLLM gpu_memory_utilization and GPU Memory Fragmentation for the throughput levers that set your tokens-per-dollar, vLLM vs SGLang for Production in 2026 for the engine the routing layer sits on, MIG vs Time-Slicing on Kubernetes for partitioning the GPUs underneath all of it, and Your GPU Dashboard Says 100% Utilized. It's Lying. for the DCGM metrics that expose every idle tensor core you are paying for.

More in LLM Infrastructure

LLM Infrastructure··13 min read

Your GPU Finishes a Request and Waits for the Slowest. Continuous Batching Is the Fix.

Static batching pads every request to the length of the longest one in the batch. Short requests finish and their GPU slots sit idle, burning money, until the whole batch drains. Continuous batching schedules at the granularity of a single token instead of a whole request — and it is the single biggest reason vLLM is 5x faster than naive serving. Here is exactly how it works.

Read post
LLM Infrastructure··15 min read

Your HPA Scales LLM Pods on CPU. They're Either Idle or On Fire.

The default Kubernetes autoscaler watches CPU. Your GPU sits at 100% no matter what. So your inference fleet either never scales or scales 90 seconds too late, after the cold start, after the SLO already broke. The signals that actually predict load, the KEDA wiring, and the cold-start tax that makes reactive scaling a trap.

Read post