LLM Operations for MLOps Engineers

Tokens: The Unit of Everything

Your team is being charged $0.03 per 1,000 input tokens by an API provider. Marketing wants you to estimate the cost per support conversation, but you have no way to predict how many tokens a 2-page document will turn into. What is a token, and why can't you just count words?

Every number that matters when you operate an LLM is denominated in tokens. The price you pay per request. The context window the model can hold. The throughput you get from a single H100. The latency budget you have to fit under. None of those numbers are denominated in characters or words. They are all tokens.

This lesson is about what a token is, why the count is unpredictable, and what that unpredictability means for the systems you build around it.

What it is

A token is the unit of input and output for an LLM. It is the atomic thing the model sees, processes, and produces. The model's "vocabulary" is a fixed set (typically 32K to 256K) of tokens, and every text you send is broken into a sequence of those tokens before the model can do anything with it.

Tokens are not characters. They are not words. They are subword units, learned from a training corpus to balance vocabulary size against sequence length. In practice:

Common short words ("the", "is", "and") are usually one token each
Common long words ("operating", "infrastructure") are usually one or two tokens
Uncommon words, names, and code identifiers often split into 3+ tokens
A leading space counts as part of the next token in most tokenizers, so " Kubernetes" is one token, "Kubernetes" without the space might be two
Numbers are often split per-digit ("12345" might be 5 tokens)
Whitespace, indentation, and punctuation each consume tokens

A useful rule of thumb for English text: 1 token ≈ 0.75 words ≈ 4 characters. So a 750-word blog post is about 1,000 tokens. But that ratio shifts dramatically for code (more tokens per character because of identifiers and indentation), for non-English text (often 2-3x more tokens), and for structured output like JSON (every brace, quote, and comma is its own token).

KEY CONCEPT

The model has no concept of "words." It has a fixed vocabulary of tokens, and every operation it performs is on that token sequence. When you write a system prompt, set a temperature, build a RAG pipeline, or estimate a cost, you are reasoning about tokens. Reasoning about characters or words will mislead you.

How it works under the hood

The journey from a string to something the model can use has three steps. The first two are deterministic; the third is the model.

From string to model output

Click each step to explore

The tokenizer is a separate artifact from the model. It is shipped alongside the weights, and you must use the matching tokenizer for the model you are serving. Mismatched tokenizer + model = garbage output, no error, you discover the problem in production.

A few worth-knowing tokenizers:

tiktoken: OpenAI's tokenizer. The cl100k_base variant powers GPT-4. Fast Rust implementation with Python bindings.
SentencePiece: Google's tokenizer. Used by Llama, Mistral, Gemma, and many others. Language-agnostic by design.
HuggingFace tokenizers: a Rust library that implements BPE, WordPiece, and Unigram. The Python binding is what you use to count tokens for almost any open model.

Token IDs are integers, but the integers are not portable. Token ID 1234 in Llama's tokenizer is a different string than token ID 1234 in GPT-4's tokenizer. Caches, logs, and analytics that store token IDs are tied to a specific tokenizer version.

Operationalizing it

Three operational patterns recur every time you put tokens at the center of your platform:

1. Count tokens before you call the model. Most production failures involving "context too long" or "cost spike" trace back to a missing pre-flight token count. Build it into your client wrapper:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")

def safe_call(prompt: str, max_new_tokens: int):
    input_tokens = len(tok.encode(prompt))
    if input_tokens + max_new_tokens > 7800:  # leave headroom under 8K context
        raise ValueError(f"Prompt too long: {input_tokens} input tokens")
    # ... actually call the model

2. Bill per token internally. If you have multiple teams sharing an LLM platform, attribute usage at the token level, not the request level. A single 50K-token request costs the same as a thousand 50-token requests. Per-request billing hides the heavy users.

3. Cache by token-ID hash, not string. If you cache responses, key them by the hash of the token ID sequence, not the raw input string. Two strings that tokenize to the same sequence (a rare but real case for whitespace variants) should hit the same cache entry; one string that tokenizes differently under two model versions should not.

WAR STORY

A team built a customer support bot that summarized chat history into the prompt on every request. They estimated 200 tokens per summary based on testing. In production, the summarizer kept inserting verbatim quotes from the customer's messages, and one customer's habit of sending stack traces blew the prompt to 8,000 tokens per request. Cost spike was 40x. The fix was a token cap on the summarizer's output. The deeper lesson: token counts are a property of the input distribution, not your average case. Plan for the long tail.

Trade-offs and decision framework

The main lever you have at the operational level is prompt budget. For any feature you build:

Decide the maximum input tokens you will allow per request, with a hard cap enforced by the client
Decide the maximum output tokens by setting max_tokens (or equivalent) in your call
The sum of these, plus a 5-10% safety margin, must be less than the model's context window

The headroom matters because some tokenizers have edge cases that produce more tokens than your pre-flight count. SentencePiece handling of trailing whitespace, BPE handling of emoji, and any model with a chat template (which adds role-marker tokens around every user and assistant turn) can all add 5-20 tokens you did not count.

For cost-sensitive features, you will spend a meaningful amount of engineering effort on prompt compression: summarizing context, dropping less-relevant retrieved chunks, using cheaper models to pre-process before the expensive model runs. Every one of those optimizations is denominated in tokens saved.

Common mistakes

Estimating cost from word count. English averages ~1.3 tokens per word, but it is not constant. Code is closer to 2-3 tokens per word. Korean and Japanese can be 3-5 tokens per word. Always count actual tokens.
Forgetting the chat template overhead. A "100-token user message" with a chat template can be 130 tokens. Multiply across millions of requests and that's real money.
Mixing tokenizer versions. If you use the wrong tokenizer to count tokens before calling a different model, your safety margin is fictional.
Truncating in the middle of a token. When a model outputs character-by-character (some streaming APIs do this), naïvely truncating to N characters can cut a multi-byte UTF-8 character or a token mid-way and produce broken output. Truncate at token boundaries.
Treating the context window as usable space. A 128K context window does not mean 128K usable tokens. Throughput drops as context grows because of KV cache pressure (covered in Context Window: Managing the Bottleneck). Real usable context is often 30-50% of the advertised window before throughput collapses.

INTERVIEW QUESTION

How would you measure tokens-per-second served by a single GPU for a 13B model, and how would that change if you switched from FP16 to INT8?

What Is an LLM, Really?

Continue

Tokenization: Building the Vocabulary

←→ navigateM toggle sidebar