Latent Space and Parameters
A vendor pitches a 175B parameter model. Your existing infra runs 7B models. Translating "175B parameters" into "what infrastructure do I need" requires knowing what a parameter is, where it lives in memory, and how that scales across tensor parallelism, pipeline parallelism, and quantization.
When people ask "what does the model know?" they are pointing at something that is genuinely hard to describe. The model's "knowledge" is not stored in any file you can grep. It is not a database row. It is a 140 GB pile of floating-point numbers, distributed across thousands of matrices, and the only way to query it is to run a forward pass.
This lesson is about what those parameters actually are, what "latent space" means in practice, and how to translate the number on a model spec sheet ("70B") into the GPU bill it produces.
What it is
A parameter is a single learned floating-point number inside the model. Every weight in every linear layer, every entry in every embedding table, every bias. Each one was set during training and is now frozen for inference.
A model with N parameters has N numbers. Llama-3 70B has roughly 70 billion of them, distributed roughly like this:
| Component | Parameters | Notes |
|---|---|---|
| Embedding table | ~1B | 128K vocab × 8192 hidden_dim |
| 80 transformer layers | ~67B | Most of the model. Each layer has ~840M params. |
| LM head (output projection) | ~1B | Often tied to the embedding table to save space |
| Layer norms, biases, etc. | ~few hundred M | Cheap, don't think about it |
The numbers in those matrices are not interpretable individually. No single parameter "represents" a fact. Knowledge is encoded in the patterns of activations they produce when you run text through them. The phrase "latent space" refers to the high-dimensional vector space those activations live in. Each token, at each layer, is represented by a vector in that space; the model's "thinking" is the trajectory of those vectors through the layers.
You will hear "the model lives in latent space" a lot. What it actually means: the model's intermediate representations are vectors of high dimension (the hidden dimension), and similar concepts produce nearby vectors at the deeper layers. It is the same intuition as embeddings, but applied internally and at every layer.
You cannot edit a fact in a model by changing parameters directly. The fact "the capital of France is Paris" is not stored as a string somewhere; it is encoded across millions of weights in a way that emerges from the forward pass. This is why fine-tuning to "fix" a wrong answer often introduces five new wrong answers. Fixing facts at the parameter level is not surgery; it is bricklaying with a sledgehammer.
How it works under the hood
When the model is loaded, the parameters are partitioned and laid out in GPU memory in a specific way. For a 70B model in FP16 on 8x H100, the layout looks something like this:
Memory layout of a 70B FP16 model on an 8x H100 node
~500 MB to 1 GB per GPU. Variable. Goes up after kernel upgrades. The headroom you forgot to budget for.
Intermediate values during each forward pass. Sized by batch_size * seq_len * hidden_dim. Freed after each forward pass but allocated during it.
Per-layer keys and values for every active sequence. Grows with concurrency and context length. Often the biggest variable in your memory budget.
Llama-3 70B in FP16 is 140 GB total, split as ~17.5 GB per GPU using tensor parallelism. The all-reduce communication during the forward pass is what eats your interconnect bandwidth.
Hover to expand each layer
The two pieces that matter operationally are the weights and the KV cache. Activations are usually small and short-lived. CUDA context is small but real (don't forget to budget for it).
Weights are sharded across GPUs using tensor parallelism. Each GPU holds a slice of every layer's weight matrices. During the forward pass, GPUs perform matrix multiplies on their slices in parallel, then communicate intermediate results via all-reduce over NVLink (within a node) or InfiniBand (across nodes). For an 8-way tensor-parallel setup of a 70B FP16 model, each GPU holds about 17.5 GB of weights.
KV cache is more interesting because it is the part that actually limits your throughput. For each sequence the model is currently processing, the keys and values from earlier attention computations are cached so they do not have to be recomputed when the next token is generated. The math:
kv_cache_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
For Llama-3 70B with 80 layers, 8 KV heads (after grouped-query attention), 128 head dimension, and FP16:
kv_per_token = 2 * 80 * 8 * 128 * 2 = 327,680 bytes ≈ 320 KB / token
At 4096-token context and 32 concurrent sequences, that is 320 KB × 4096 × 32 ≈ 40 GB of KV cache, in addition to your weights and activations. This is why high-concurrency LLM serving is hard.
Operationalizing it
The single most useful thing you can carry into a capacity planning conversation is the memory formula:
total_gpu_memory >= weights + kv_cache + activations + headroom
weights = num_params * bytes_per_param
kv_cache = kv_per_token * max_concurrent_sequences * max_context_length
activations = batch_size * seq_len * hidden_dim * 2 * (small multiplier, framework-dependent)
headroom = ~10-20% of total
Plug in concrete numbers for the workload you are sizing. If the answer exceeds your hardware, you have four levers in roughly increasing complexity:
- Quantize the weights. FP16 → INT8 halves weight memory. INT4 quarters it. Usually 1-3% quality drop, varies by model. Almost always worth trying first.
- Reduce concurrency or context. Cuts KV cache linearly. Loses you throughput or capability.
- Add tensor parallelism. Splits weights across more GPUs. Adds interconnect cost.
- Add pipeline parallelism. Splits layers across GPUs (rather than splitting each layer across GPUs). More complex, used for very large models that span nodes.
For most production deployments, the order is: quantize first, then add tensor parallelism, then pick context/concurrency limits, then add pipeline parallelism only if the model genuinely cannot fit otherwise. The full tuning loop is in the vLLM tuning post.
A team tried to squeeze a Llama-3 405B onto a single 8x H100 node by quantizing aggressively to INT4. The math worked: 405B / 2 = 200 GB, which fits on 8x80GB. They forgot the activations, KV cache, and CUDA context. First request OOMed. Second request OOMed. They ended up at 16x H100 across two nodes with INT8, with all the cross-node interconnect pain that implies. Lesson: the memory formula has four terms. Forgetting any of them is a CrashLoopBackOff.
Trade-offs and decision framework
The decision you usually have to defend is "how many GPUs do I need to serve this model?" The answer is a function of:
| Variable | Effect on memory | Effect on quality | Effect on latency |
|---|---|---|---|
| Quantize FP16 → INT8 | -50% weights | -1 to -3% on most benchmarks | Roughly neutral; may be faster on some hardware |
| Quantize INT8 → INT4 | -50% weights again | -3 to -8% on most benchmarks | Often slower per token (some quant kernels are not optimized) |
| Add tensor parallelism | -50% weights per GPU | None | Slower per token (interconnect overhead) |
| Reduce max context | -X% KV cache | None | None |
| Reduce max concurrency | -X% KV cache | None | Lower throughput |
The honest answer is "run the math, then run a benchmark." The math gives you the lower bound on hardware. The benchmark tells you what your latency and throughput actually are at that hardware level. You almost always need both numbers before you can defend a capacity plan.
Common mistakes
- Quoting param count to non-engineers as if it determines quality. Modern models with strong training data can dramatically outperform older models with more parameters. Mistral 7B beat many 13B models from a year earlier. Param count is not quality.
- Sizing GPU memory for weights only. Repeating from earlier lessons because it is the #1 production sizing error. Always include KV cache and activations.
- Assuming quantization is free. Quality drops are usually small but always non-zero, and they vary by task. Run your real evaluation suite at each precision before committing.
- Confusing pipeline and tensor parallelism. Tensor parallelism splits each layer across GPUs (lots of communication, low latency overhead). Pipeline parallelism splits sequential layers across GPUs (less communication, but introduces a "pipeline bubble" that hurts latency). They solve different problems.
- Treating parameter editing as fine-tuning. "Just patch the weights to fix this output" is a research problem (model editing, ROME, MEMIT). It is not a production technique. Fine-tuning is the closest production-ready cousin and even it works at the distribution level, not the fact level.
Give me a formula for the GPU memory required to serve a model in inference mode. Walk through each term, then apply it to a 70B model in FP16 with a 4K context window at concurrency 32.