One Label Added Four Million Series to Your Prometheus. Here Is the Math.
A developer adds user_id to one counter to debug a support ticket. A week later Prometheus is eating 60 GB of RAM and queries time out. This is cardinality, the hidden cost center of metrics, and the math that predicts the disaster before it happens.
A developer is chasing a support ticket. One customer says their requests are slow, and nobody can reproduce it. So the developer does the obvious thing: adds a user_id label to the HTTP request counter, ships it, and now they can query exactly what that one customer is doing. It works. The ticket gets closed.
A week later your phone buzzes. Prometheus is using 60 GB of RAM, the scrape loop is falling behind, and half your Grafana dashboards time out on refresh. Nothing deployed in the last hour. No traffic spike. The cluster looks healthy. Prometheus itself is the thing that is dying.
You run a query to see which metric is the worst offender:
topk(5, count by (__name__)({__name__=~".+"}))
And one line stands out:
http_requests_total 4218934
One metric. 4.2 million time series. A week ago it had a few thousand. The user_id label did this, and the surprising part is not that it happened, it is that the math guaranteed it the moment the label shipped.
Cardinality is multiplication, not addition#
The cardinality of a metric is the number of unique label-value combinations it produces. Every unique combination is a separate time series, stored separately, indexed separately, queried separately.
The formula is the whole story:
cardinality = product of (distinct values per label)
Say http_requests_total carries three labels:
method: 4 distinct values (GET, POST, PUT, DELETE)route: 10 distinct valuesstatus: 20 distinct values
That is 4 * 10 * 20 = 800 time series. Completely fine. Prometheus handles that without noticing.
Now add user_id with 2 million active users:
4 * 10 * 20 * 2,000,000 = 1.6 billion potential series
This is the trap that catches people: a new label does not add to your series count, it multiplies it. The developer was thinking "I added one label." The TSDB experienced "every existing series just got cloned once per user." Anything that scales with users, requests, sessions, or time is not a label, it is a multiplier on everything else.
A label's danger is not its own size, it is the product it creates with every other label on the metric. A bounded label (region, status class, HTTP method) is safe forever. An unbounded label (user_id, request_id, raw URL, timestamp) multiplies your entire series count by its growth, and it never stops growing.
Why each series is expensive#
A single time series sounds cheap. The problem is that the cost is paid three times over, and all three scale linearly with series count.
A useful rule of thumb for Prometheus:
- Memory: roughly 1 to 3 KB of RAM per active series (label strings, the inverted index entry, the in-memory sample chunk). Call it ~3 KB to be safe.
- Storage: at a typical 15s scrape interval, samples compress to roughly 1 to 2 bytes each, so a series retained for an hour is a few hundred bytes on disk, and it accumulates for the whole retention window (15 days by default).
- Query CPU: every query that matches a metric name has to walk the index for all series under it, so a
rate()over a metric with millions of series scans millions of series.
Run the memory number forward:
1 million series x ~3 KB = ~3 GB RAM just for the active series index
10 million series x ~3 KB = ~30 GB RAM
So 4.2 million series is somewhere around 12 to 15 GB of resident memory for that one metric, before counting the query-time working set. That is how an 8 GB Prometheus becomes a 60 GB Prometheus over a few weeks of retention filling up. And query latency degrades non-linearly: the dashboard that returned in 40 ms now scans for several seconds and trips Grafana's timeout, which is the symptom that actually pages you.
Long-term stores (VictoriaMetrics, Mimir, Thanos) compress better and shard the index, so they survive higher cardinality. But the economics do not change. High cardinality is more expensive everywhere, it just fails louder on a single Prometheus.
Finding the label that did it#
You already found the metric with the topk query above. Now confirm which label is the multiplier. For a suspected label:
# How many distinct values does user_id actually have on this metric?
count(count by (user_id)(http_requests_total))
If that returns 1847293, there is your answer. The label has 1.8 million distinct values, and each one forked every other series on the metric.
If you do not yet know which label is guilty, this prints the worst metrics by series count so you know where to look first:
topk(10, count by (__name__)({__name__=~".+"}))
On Mimir or Cortex, the per-tenant ingester series gauge tells you the same story at the fleet level:
cortex_ingester_memory_series
Stopping the bleeding#
There are two clocks running: the instrumentation (still emitting the bad label on every scrape) and the TSDB (holding the series it already ingested until retention expires). You have to stop both.
1. Remove the label in code and redeploy. This is the real fix. Everything else is cleanup. Until the instrumentation stops emitting user_id, you are still ingesting new series every scrape.
2. Drop the label at scrape time so you are protected during the rollout, and so a teammate cannot reintroduce it without you noticing:
# In the scrape_config for this target
metric_relabel_configs:
- action: labeldrop
regex: user_id
3. Keep it out of long-term storage if you remote-write to Mimir/Thanos/VictoriaMetrics:
remote_write:
- url: http://mimir/api/v1/push
write_relabel_configs:
- action: labeldrop
regex: user_id
4. Let retention reclaim the old series. Even after you stop emitting the label, the 4.2 million dead series sit in the TSDB until they age out (default 15 days). They stop receiving samples, so memory pressure eases as blocks compact, but disk does not free instantly. Deleting TSDB blocks by hand and restarting is the nuclear option, only reach for it if the server is actively OOM-crash-looping and cannot survive to retention.
A team turned on Istio's default mesh metrics and their Prometheus RAM doubled overnight with no application change at all. Envoy's default metric set ships with a dozen labels (source service, source version, destination service, destination version, destination subset, response code, connection security, and more), and in a cluster with dozens of services those labels multiply into an enormous series count. The fix was not turning off the mesh, it was a metric_relabel_configs block that kept only the five or six labels they actually queried. RAM dropped back to baseline. The lesson: it is not only your own code that adds labels. Any dependency that auto-instruments (a service mesh, an OpenTelemetry collector, a client library) can blow up cardinality on your behalf, so review what it emits before you ship it cluster-wide.
The labels that are always traps#
You do not need to memorize a list. You need one question: is the set of possible values bounded and small, or does it grow with users, requests, or time?
| Label | Distinct values | Verdict |
|---|---|---|
user_id, email, session_id | grows with users | Never |
request_id, trace_id | one per request | Never |
raw URL with IDs (/api/users/12345) | grows with entities | Never |
| any timestamp in a label | unbounded | Never |
| free-text error message | unbounded | Never |
route pattern (/api/users/:id) | one per route | Safe |
method, status_class, region, env | small fixed set | Safe |
| version SHA | a few per day | Usually safe |
The two most common rewrites that fix real incidents:
# BAD: every entity is a new series
http_requests_total{path="/api/users/12345"}
# GOOD: one series per route, regardless of which user
http_requests_total{route="/api/users/:id"}
# BAD: every distinct message is a new series
request_errors_total{error="invalid email: foo@bar.invalid"}
# GOOD: a bounded enum; the detail goes to logs
request_errors_total{error_class="invalid_email"}
And for anything that is a distribution (latency, payload size, duration), the answer is never a label per value. It is a histogram with a fixed bucket count:
http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.5"}
http_request_duration_seconds_bucket{le="1"}
http_request_duration_seconds_bucket{le="5"}
Bound it before it happens#
The reason this incident is so common is that nothing stops it. The bad label ships, the series count climbs quietly for days, and the first signal anyone gets is RAM pressure long after the change. The fix is to make cardinality a number you watch, not a number you discover.
Give each high-volume metric a documented budget:
http_requests_total
expected labels: method (4), route (~30), status_class (5)
expected series: 4 * 30 * 5 = 600
ceiling: 5000
Then alert when reality exceeds the ceiling:
count(count by (__name__)({__name__="http_requests_total"})) > 5000
A cardinality-ceiling alert is the single cheapest piece of observability insurance you can buy. When someone ships http_requests_total{user_id=...}, the alert fires within a scrape interval or two, the change gets reverted the same afternoon, and the 3am page never happens. You are trading one alert rule per high-volume metric for never debugging a Prometheus OOM under pressure again.
When you genuinely need per-request detail#
Sometimes the question really is "what is this user experiencing right now" or "trace this request end to end." That is a legitimate need. The mistake is trying to answer it with metric labels.
Use the tool whose cost structure fits the question:
| Question | Wrong tool | Right tool |
|---|---|---|
| Aggregate behavior over time | logs | metrics |
| One specific request, end to end | metrics with per-request labels | traces |
| What did one user do | metrics with user_id | logs filtered by user_id |
| Latency distribution | high-cardinality labels | histogram buckets |
Metrics are built for bounded, aggregate questions and they are extremely cheap at that job. Per-entity questions belong in logs (Loki, Elasticsearch), traces (Tempo, Jaeger), or wide-event stores (Honeycomb, Datadog APM) that are designed to absorb high cardinality. The original support ticket that started all this? That was a per-user question. It wanted a log query or a trace lookup, not a new metric label.
The mental model that prevents it#
Stop reading a label as "a field I can filter on." Read it as "a multiplier on my total series count, equal to its number of distinct values." Once that is the instinct, the whole class of incident disappears: you see user_id proposed as a label and you hear "multiply everything by two million," and you reach for a log field instead.
Bounded labels are free. Unbounded labels are a slow-motion outage with the timer already running. The math tells you which is which before you ship, if you do the multiplication.
Cardinality is one of the first things we cover in the Observability Fundamentals course, alongside the four golden signals, PromQL you will actually use, structured logging, distributed tracing with OpenTelemetry, and SLOs with error budgets, the full third pillar from first principles. The cardinality lesson itself is a free preview. Related reading: DCGM + Prometheus GPU Observability for what good Prometheus instrumentation looks like on GPU fleets, and Your Pod Is Using 5% CPU and Still Throttled for another case where the dashboard and the underlying reality disagree.