LLM Operations for MLOps Engineers

Embeddings: Turning Text into Math

Your search team wants to switch from BM25 to "semantic search using embeddings." They mention vector databases, dimensions, and similarity scores. Before you provision a new database tier, you need to understand what an embedding actually is, where it lives, and what it costs to store at your document scale.

There are two completely different things that get called "embeddings" in production, and conflating them is the most common source of architectural mistakes I see.

The first kind lives inside the LLM and you do not touch it directly. The second kind is a separate model, a separate API call, and a separate piece of infrastructure. This lesson is mostly about the second one because it is the one you will operate. But you need to understand both because every RAG system uses one to feed the other.

What it is

An embedding is a fixed-size vector of floating-point numbers that represents the "meaning" of a piece of text in a way the model can compute with. Geometrically, similar texts produce nearby vectors; unrelated texts produce vectors far apart. That is the entire premise.

Concretely:

An embedding is a list of N floats, where N is the embedding dimension. Common values: 384, 768, 1024, 1536, 3072.
"Similarity" between two embeddings is usually computed as cosine similarity (the dot product of the two unit-length vectors), which ranges from -1 (opposite) to 1 (identical).
The mapping from text to embedding is a learned function, produced by a model that was trained specifically for the task of producing useful similarity scores.

The two flavors:

Internal embeddings (the model's own embedding layer). Every LLM has a lookup table that maps each token ID to a vector of size hidden_dim (e.g. 8192 for Llama-3 70B). These are an implementation detail of the LLM. You do not call an API to get them. They live inside the GPU during inference and are useless outside it.
External embedding models. Separate models (OpenAI's text-embedding-3-large, Cohere's embed models, BGE, E5, Nomic, etc.) trained specifically to produce one vector per piece of text such that semantic similarity translates to vector similarity. These are what powers RAG, semantic search, and clustering.

When someone says "we use embeddings for search," they always mean the second kind.

KEY CONCEPT

The embedding model and the LLM are two separate models. They do not share weights. Picking a strong embedding model is a separate engineering decision from picking a strong LLM, and changing either one independently is normal. A common production pattern is "GPT-4 for generation, BGE-large for embeddings, Cohere reranker on top."

How it works under the hood

External embedding models are usually transformer encoders (BERT-style: bidirectional attention, no autoregressive decoding) that produce one vector per input. The vector is typically taken from a special [CLS] token's final hidden state or from mean-pooling all the token vectors.

From document to retrievable vector

Click each step to explore

The "approximate" in ANN matters. Exact nearest neighbor over millions of vectors is too slow for online serving. Vector databases use approximate algorithms (HNSW, IVF, ScaNN, DiskANN) that trade a small amount of recall for orders of magnitude speedup. Recall@10 of 0.95 means "for any given query, the top-10 results match the exact top-10 about 95% of the time." This is usually fine. It is also a tunable parameter, and tuning it wrong is a common source of "search just got worse" tickets.

The dimension is the most important spec because everything downstream is sized by it:

Dimension	Vector size (FP32)	Storage per 1M vectors	Notes
384	1.5 KB	1.5 GB	Compact, often a small open model
768	3 KB	3 GB	Sentence-transformers default
1024	4 KB	4 GB	BGE-large, E5-large class
1536	6 KB	6 GB	OpenAI ada-002, text-embedding-3-small
3072	12 KB	12 GB	OpenAI text-embedding-3-large

For 100M documents, those storage numbers become 150 GB to 1.2 TB just for the vectors, before any index overhead (HNSW typically adds 1.5-3x on top).

Operationalizing it

Three operational patterns dominate:

1. Pick the dimension based on quality vs. cost, not vendor defaults. Higher dimension does not always mean better retrieval; it does always mean more storage and slower search. For most domains, 768 or 1024 dimensions captures 95% of the quality of a 3072-dim model at a fraction of the cost. Run a retrieval evaluation on your own data before committing to a dimension.

2. Version your embedding model the same way you version your LLM. Embeddings produced by bge-large-v1.5 are not comparable to embeddings from bge-large-v2. If you upgrade, you have to re-embed your entire corpus. Plan for this. A 100M-document re-embedding pass is a multi-day job at production scale.

3. Store the embedding model's identity alongside every vector.

CREATE TABLE chunks (
    id UUID PRIMARY KEY,
    document_id UUID,
    chunk_text TEXT,
    embedding VECTOR(1024),
    embedding_model TEXT NOT NULL,  -- e.g. "bge-large-en-v1.5"
    embedding_model_version TEXT NOT NULL,
    embedded_at TIMESTAMPTZ NOT NULL
);

When you migrate models, this lets you query "which vectors are stale?" and re-embed incrementally.

WAR STORY

A team had a beautiful RAG pipeline using OpenAI's text-embedding-ada-002. OpenAI announced its deprecation. The team flipped to text-embedding-3-small thinking "same dimensionality, same vendor, drop-in replacement." Search relevance collapsed. The new model produces vectors in a different space; you cannot mix and match. They had to re-embed their entire 80M-document index, which took 4 days and cost about $14K in API calls. Lesson: embedding model migrations are full re-indexing events. Schedule them like database migrations.

Trade-offs and decision framework

There are five real decisions when you adopt embeddings:

Build vs. buy the embedding model. Hosted (OpenAI, Cohere, Voyage) is faster to start, costs per call, no ops. Self-hosted (BGE, E5, Nomic on a GPU) is cheaper at scale, fixes data egress concerns, but you operate it.
Dimension. Smaller is cheaper and faster to search; larger is usually slightly more accurate. Most production teams land at 768 or 1024.
Vector database. Postgres with pgvector is the easiest path if you already run Postgres and have under ~10M vectors. Above that, dedicated vector DBs (Qdrant, Weaviate, Pinecone, Milvus) start to win on query latency and operational ergonomics.
Index algorithm. HNSW is the default and usually right. IVF is faster to build but worse for updates. DiskANN scales to billions of vectors but adds complexity. Pick the default unless you have a specific reason.
Reranking. A second model (often a smaller one) re-scores the top-K from your vector search. Adds latency, often worth it for quality. Cohere's reranker is the easy option. Open alternatives exist (BGE reranker, ColBERT) for self-hosters.

Common mistakes

Mixing embedding models in the same index. Vectors from different models live in different geometric spaces. They are not comparable, even if the dimensions match.
Sizing the vector DB by document count instead of vector count. If you chunk each document into 10 pieces, you have 10x the vectors, not 1x.
Embedding the entire document instead of chunking. The model has a max input length (often 512 or 8K tokens). Above that it silently truncates, and your embedding represents only the first part of the document.
Computing cosine similarity without normalizing first. If your vectors are not unit-length, cosine similarity needs an explicit normalization step. Doing it implicitly via dot product on un-normalized vectors gives garbage.
Treating high cosine similarity as "this is the right answer." Two paragraphs about cats can have cosine similarity 0.85 even if they say opposite things. Embeddings capture topical similarity, not factual agreement. RAG layers a reranker and an LLM on top to handle that gap.

INTERVIEW QUESTION

How do you decide on the right embedding dimension for a production retrieval system? Walk through the cost vs. quality tradeoffs at 100 million documents.

Tokenization: Building the Vocabulary

Continue

Latent Space and Parameters

←→ navigateM toggle sidebar