What Is an LLM, Really?
Your team is evaluating whether to self-host a 70B parameter model vs. using an API. Before you make a recommendation, you need to understand what an LLM actually is, not the marketing pitch, but the architecture. What's inside this thing, and why does it need 8 GPUs?
If you have ever shipped a Kubernetes service, an LLM is going to feel different in one specific way: it is the only "service" you will ever run where the binary is the data. There is no Go program here. There is a 140 GB pile of floating-point numbers and a 200-line file of attention math that knows how to use them. That distinction shapes everything about how you operate it.
This lesson is a tour of what is actually inside an LLM, framed for someone who has to put it in production rather than train it.
What it is
An LLM is a stack of identical transformer blocks with three things wrapped around it: an embedding layer at the bottom that turns tokens into vectors, a stack of attention + feed-forward blocks in the middle that does all the work, and an output projection at the top that turns the final vector back into "which token comes next."
When someone says "Llama-3 70B," the 70B is the count of learnable parameters across all of those layers. Each parameter is a float. At FP16 that is 2 bytes, so 70B parameters is 140 GB just for the weights. That is the number the rest of your infrastructure has to tolerate.
Three pieces of vocabulary that are worth getting right because every other lesson uses them:
- Parameter: a single learned number. A weight in a matrix or a bias in a layer norm.
- Layer (also called a "block"): one transformer block, which has self-attention and a feed-forward network inside it. Llama-3 70B has 80 layers.
- Forward pass: feeding tokens through the model once to produce a single next-token probability distribution. This is what you do at inference. Training does this plus a backward pass. You will only ever do forward passes.
A forward pass is the thing your serving stack actually runs. It takes a batch of token sequences as input and produces, for each sequence, a probability distribution over the entire vocabulary for the next token. Then your sampler picks one. That is inference. Everything else is plumbing.
How it works under the hood
A model is just a function: tokens go in, a vector of vocab-sized probabilities comes out. The interesting part is what happens in between. The same shape of block is applied 80 times for Llama-3 70B, with different learned weights at each layer.
What's inside an LLM, top to bottom
A single linear layer that projects the final hidden vector into a vector of vocabulary size. Then softmax turns it into a probability distribution. This is the only place the model commits to a token.
Self-attention + feed-forward network + two layer norms + two residual connections. Identical shape as every other block, with its own learned weights.
Llama-3 70B has 80 layers. Llama-3 8B has 32. The number of layers and the hidden dimension are the two knobs that drive parameter count.
Same architecture as layer N. Difference is what it learned: early layers tend to handle syntax and surface structure, later layers handle abstract concepts.
A lookup table mapping each of the ~128K vocabulary tokens to a vector. For Llama-3 with hidden dim 8192, this is a 128K x 8192 matrix. About 1B parameters, all by itself.
Hover to expand each layer
The transformer block is the one piece worth understanding in detail because it is where your GPU spends almost all of its time. Inside each block:
-
Self-attention lets each token look at every other token in the sequence. This is the part that scales as O(seq_len^2) and is the reason long context windows are expensive. Modern models use techniques like Grouped-Query Attention (Llama-3) or sliding window attention (Mistral) to reduce that cost, but the fundamental shape is the same.
-
Feed-forward network (FFN) is two linear layers with an activation in between. Sounds boring. It is roughly two-thirds of the model's parameters. When you read "MoE" (Mixture of Experts), this is the layer that gets sparsified.
-
Layer norms and residual connections wrap around both, providing training stability and a gradient highway. They are cheap and you mostly ignore them at the operational level.
The attention math is softmax(QK^T / sqrt(d_k)) V, but the operationally important property is: it produces, for each token in the sequence, a context-aware representation that mixes information from every previous token. That is how the model "reads" the prompt.
Operationalizing it
The shape of the model dictates the shape of your infrastructure. Three things matter most:
Memory for weights. Multiply parameter count by bytes-per-param. FP32 = 4 bytes, FP16/BF16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes (in theory; with grouping and activation overhead, more like 0.6).
| Model | Params | FP16 weights | INT8 weights | INT4 weights |
|---|---|---|---|---|
| Llama-3 8B | 8B | 16 GB | 8 GB | ~4.5 GB |
| Llama-3 70B | 70B | 140 GB | 70 GB | ~40 GB |
| Llama-3 405B | 405B | 810 GB | 405 GB | ~230 GB |
An H100 has 80 GB of HBM. A 70B in FP16 needs 2 H100s minimum just to fit, before you have allocated a single byte of KV cache or activations. That is why "70B model" usually means "8x H100 node" once you account for headroom and tensor parallelism overhead.
Memory for KV cache. When the model generates a token, the keys and values from earlier attention layers are cached so subsequent tokens do not reprocess them. KV cache size grows with sequence length and concurrency, and it can easily be larger than the weights at high concurrency. We cover this in detail in the vLLM tuning post, but the rough formula is:
kv_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
total_kv = kv_per_token * total_active_tokens
For Llama-3 70B, that is roughly 320 KB per token. At 4K context and 32 concurrent sequences, you are at ~40 GB of KV cache. On a 2-GPU FP16 deployment, you have ~20 GB free for KV cache. Math does not work, you OOM.
Tensor parallelism. When weights do not fit on one GPU, you split them across multiple GPUs and the framework (vLLM, TGI, TensorRT-LLM) handles the all-reduce communication during the forward pass. NVLink and InfiniBand exist because this matrix all-reduce is bandwidth-hungry. If your nodes are network-isolated, tensor parallelism collapses.
A team wanted to "save money" by running Llama-3 70B on 8x A10G GPUs (24 GB each, total 192 GB) over a 25 Gbps network. Math says it fits. Real-world throughput was 1/8 of what they expected. The all-reduce was saturating the network for every single token. They were paying for 8 GPUs and getting 1 GPU's worth of useful work. Move to H100s with NVLink, problem solved, but the sticker price tripled. Lesson: parallelism math is necessary but not sufficient. You also need the right interconnect.
Trade-offs and decision framework
The "self-host 70B vs. API" question that opened this lesson reduces to four levers:
- Cost per token at your QPS. Self-hosting wins above some break-even (often around 100M-500M tokens/day, but it depends heavily on your latency budget and which GPUs you can buy). Below that, API is cheaper after you account for engineering time.
- Latency floor. API has network round-trip. Self-hosted on the same VPC can be 30-100 ms faster on time-to-first-token, which matters for chat UX.
- Data egress. Some data legally cannot leave your network. That is a single yes/no answer that overrides everything else.
- Operational depth. Self-hosting means you need to know everything in this course. APIs let you skip 80% of it. Whether that is the right trade depends on your team and what you are building.
The model size you pick is downstream of those four. A team that is API-bound for compliance reasons but only needs an 8B-class model is in a different conversation than a team trying to host a 405B for cost reasons.
Common mistakes
- Treating "parameter count" as quality. A 7B Llama-3 outperforms a 13B Llama-2 on most things. Architecture and training data matter more than param count within a generation. Across generations, both matter.
- Sizing only for weights. Forgetting KV cache and activations is the #1 source of "we provisioned correctly but it OOMs in production" tickets.
- Assuming inference scales like training. Training is a batch operation; inference is an interactive one. The serving math is dominated by KV cache and continuous batching, not just FLOPS. We cover this in Inference: Serving Predictions at Scale.
- Picking GPU memory based on the marketing chart. "Fits on a single A100 80GB" usually means weights only, in FP16, with no KV cache, at batch 1, at context 1. Real serving needs a 30-50% headroom buffer. Always.
- Confusing "open weights" with "open source". Llama, Mistral, and most "open" models give you the weights under a license but not the training data or the training code. You can serve them. You cannot easily fork them.
How would you estimate the GPU memory needed to serve a 13B parameter model? What about a 70B model?