Observability Fundamentals for Engineers

Cardinality and Why It Matters

A team adds user_id as a label to their HTTP request counter. It works beautifully — they can query "how many requests did user 42 make last hour?" directly from Prometheus. A week later, the Prometheus server is consuming 80 GB of RAM, queries are timing out, and the on-call engineer is paged at 3 AM because scraping is failing. They had 2 million active users. Each user generated a new unique time series. The metric http_requests_total{user_id="..."} expanded into 2 million separate series — each stored, each indexed, each queried. The fix: remove user_id. Prometheus recovers, 2 million series get garbage-collected, and the team adopts a rule: user_id never goes in metric labels.

Cardinality is the hidden cost center of observability. It is also the number-one way teams accidentally destroy their Prometheus/Mimir/VictoriaMetrics infrastructure. This lesson explains what cardinality is, why high-cardinality metrics cost exponentially more, and the specific label-design practices that keep costs sane.


What Cardinality Is

The cardinality of a metric is the number of unique combinations of label values it can produce. Each unique combination is a separate time series.

Example:

http_requests_total{method="GET", endpoint="/api/users", status="200"}
http_requests_total{method="GET", endpoint="/api/users", status="500"}
http_requests_total{method="POST", endpoint="/api/users", status="201"}
http_requests_total{method="GET", endpoint="/api/orders", status="200"}

Four distinct time series, because the (method, endpoint, status) combinations differ.

Formula:

cardinality = product of (distinct values per label)

If you have:

  • 10 endpoints
  • 4 methods
  • 20 status codes

Cardinality = 10 x 4 x 20 = 800 time series.

Add user_id with 2 million unique values:

Cardinality = 10 x 4 x 20 x 2,000,000 = 1.6 billion time series.

KEY CONCEPT

Every new label with high cardinality multiplies your total series count. Prometheus was designed assuming cardinality in the tens of thousands per metric, not billions. High-cardinality labels do not just cost more — they cost exponentially more as they multiply with your existing labels.


Why Cardinality Is Expensive

Prometheus (and every metric backend: VictoriaMetrics, Mimir, Cortex, Thanos) indexes by label set. Each unique label combination needs:

  • A row in the series index.
  • Memory for the series metadata.
  • Storage for the time-series samples.
  • CPU for queries that touch it.

A back-of-envelope calculation:

  • Each series: ~3 KB of memory (label strings, index entries, sample cache).
  • Each series: ~1-2 bytes per sample stored (highly compressed), at default 15s scrape = ~240 samples/hour = ~500 bytes/hour retained.

At 1 million series:

  • Memory: ~3 GB for active series index alone.
  • Storage: ~500 MB per hour retained = 12 GB/day = 360 GB/month.

At 10 million series:

  • Memory: ~30 GB.
  • Storage: ~3.6 TB per month.

And query performance degrades non-linearly. A query that scans across millions of series takes seconds instead of milliseconds; during a dashboard refresh, you time out.

Modern long-term stores (VictoriaMetrics, Mimir) handle higher cardinality with better compression, but the economics still favor keeping cardinality bounded.


The High-Cardinality Hall of Shame

Labels you almost never want:

LabelTypical cardinalityImpact
user_id10,000 to billionsCatastrophic
trace_id / request_idEvery request (millions/hr)Catastrophic
session_idActive sessions (thousands to millions)Catastrophic
emailUnique usersCatastrophic
ip_addressIPv4 space is 4BVery high
Raw URL path (with IDs baked in)UnboundedHigh
Timestamp in labelUnboundedCatastrophic
Error message textUnboundedVery high
Customer nameDepends — tenants?Can be OK at low scale
Version SHALow (deploys per day)OK

Anything that scales with users, requests, or time is dangerous. Anything with a bounded list (regions, environments, services) is fine.


Good vs Bad Label Design

BAD: URL path with IDs embedded

http_requests_total{path="/api/users/12345"}
http_requests_total{path="/api/users/12346"}
http_requests_total{path="/api/users/12347"}

Every user generates a new series. Cardinality explodes.

GOOD: Route pattern

http_requests_total{route="/api/users/:id"}

One series per route regardless of user. Your HTTP framework (Express, Gin, FastAPI, etc.) exposes the route pattern — use it.

BAD: Error message as label

request_errors_total{error="Invalid email format: foo@bar.invalid"}
request_errors_total{error="Invalid email format: test@nowhere"}

Each distinct error message becomes a new series.

GOOD: Error class

request_errors_total{error_class="invalid_email"}

Bounded set of error classes; unbounded details go to logs.

BAD: Response time as a label

slow_requests_total{duration_ms="1247"}

Every distinct duration is a new series. Use a histogram instead:

GOOD: Histogram buckets

http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.5"}
http_request_duration_seconds_bucket{le="1"}
http_request_duration_seconds_bucket{le="5"}

Fixed number of buckets; answers all "how many requests were faster than X?" questions.


How to Monitor Cardinality

See your top metrics by cardinality

# Number of series per metric name (Prometheus)
topk(10, count by (__name__) ({__name__=~".+"}))

This prints the 10 metrics contributing the most series. If one metric has millions, you have a cardinality problem.

Find which labels are exploding

# Cardinality of a specific metric by label
count(count by (user_id) (http_requests_total))

If this returns 1,847,293 — your user_id label has that many distinct values. Remove it.

Per-tenant cardinality (Mimir / Cortex)

# Mimir/Cortex tenant metrics
curl http://mimir:8080/api/v1/user_stats

# Or query from Prometheus
cortex_ingester_memory_series{job="mimir-ingester"}

For managed services (Grafana Cloud, Datadog, etc.), cardinality is a billable dimension — they surface it in usage dashboards.


The Cardinality Budget

Treat cardinality as a finite budget per metric. Document it:

# metric-cardinality-budgets.md

http_requests_total:
  expected labels: method (5), route (~30), status_class (5)
  expected cardinality: 5 * 30 * 5 = 750
  max allowed: 5,000

database_query_duration_seconds:
  expected labels: operation (10), table (50)
  expected cardinality: 500
  max allowed: 2,000

Alert if cardinality exceeds the budget:

count(count by (__name__) ({__name__="http_requests_total"})) > 5000

This catches accidental label explosions before they crash the server.

PRO TIP

Set up a cardinality-ceiling alert for each high-volume metric. When someone accidentally ships http_requests_total{user_id=...}, the alert fires within minutes, they roll back, and the disaster is averted. Most teams learn this lesson the hard way — you can learn it cheaply.


When You Actually Need High Cardinality

Sometimes the question is genuinely per-user, per-request, per-session:

  • What is user X experiencing right now?
  • Trace this specific request end-to-end.
  • What is the distribution of latencies by customer?

The answer is NOT to add high-cardinality labels to Prometheus metrics. The answer is to use the right tool for the job:

QuestionWrong toolRight tool
Aggregate behaviorLogs, tracesMetrics
Per-request investigationMetrics (per-request labels)Traces
Per-user behaviorMetrics (user_id label)Logs (filterable by user_id) / Events / Wide events
Heat-map of latency distributionMetrics with high-cardinality labelsHistogram metrics + tracing

Modern observability split: Prometheus for aggregate patterns, Loki/Elasticsearch for log-level detail, Tempo/Jaeger for per-request traces. The tools differ because the cost structures differ. Event-centric tools (Honeycomb, Datadog APM) handle high cardinality natively but cost more per event.


Wide Events — the Modern Alternative

Observability tools like Honeycomb (and increasingly, OpenTelemetry's philosophy) emphasize wide events: structured per-request records with 50-200 attributes. Queries are "show me p99 latency grouped by customer, region, version, feature-flag."

This is powerful — cardinality is no longer a budget constraint — but requires a different backend designed for it. Prometheus cannot do this. ClickHouse-backed tools, Honeycomb, Datadog APM, and similar CAN.

Many teams run:

  • Prometheus for aggregate metrics with bounded cardinality (the golden signals, dashboards, alerts).
  • Wide-event store for investigation (traces + spans + attributes).
  • Loki / Elasticsearch for raw logs.

This gets you cardinality where you need it (per-request/per-user investigation) without blowing up Prometheus.


Migration Paths When Cardinality Explodes

When you discover a high-cardinality metric in production:

1. Stop the bleeding

Remove the offending label from the metric (change the instrumentation, redeploy).

2. Drop the old series

# Prometheus relabel config — drop a specific metric at scrape time
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'http_requests_total'
    action: drop
  # or better, drop the label:
  - action: labeldrop
    regex: user_id

Or, in Prometheus config, remote_write with write_relabel_configs to drop before shipping to long-term storage.

3. Clean up old TSDB data

Old series persist until retention expires (default 15 days). Either wait it out, or if urgent, delete the tsdb directory and restart (drastic; use only if the series are crashing the server).

4. Post-mortem

Add a cardinality-ceiling alert so this does not recur. Document the label-design rules for your team.


Common Cardinality Traps

Dynamically generated labels from user input

# User input becomes a label value — unbounded
http_request_errors_total{message=$userInput}

# Fix: classify into a bounded error enum first

Changing label values on deploy

# version label changes on every deploy
http_requests_total{version="abc123"}
http_requests_total{version="def456"}

OK if you deploy a few times a day. Dangerous if you deploy many times an hour or run many variants. Set a cardinality ceiling; consider dropping the label for long-term storage.

Kubernetes pod/instance labels

# pod names in k8s are dynamic (pod-abc-xyz123)
up{pod="my-app-abc-def12"}

This is mostly OK because pods have bounded lifetime, but if you never GC old series, it accumulates. Ensure Prometheus retention drops them.

Histogram buckets

Histograms with too many buckets increase cardinality. Typical Prometheus default: 10-15 buckets. Custom histograms sometimes ship with 50+; that is 50x the series count.

Service mesh / Envoy metrics

Istio, Linkerd, Envoy emit rich per-request metrics that can explode. Review their default metric config; filter to keep only what you need.

WAR STORY

A team enabled Istio mesh monitoring and their Prometheus server RAM doubled overnight. Istios default metric set has dozens of labels (request method, response code, source service, source version, destination service, destination version, destination subset, connection security, etc.) and multiplies enormously in a cluster with many services. The fix was metric_relabel_configs to keep only the 5-6 labels they actually queried. Prometheus RAM dropped back to normal. Before deploying a mesh or a high-cardinality system, review and cap its metric output.


Summary Rules

  1. Never put user IDs, session IDs, trace IDs, email addresses, or timestamps in metric labels.
  2. Route patterns, not raw URLs. /api/users/:id, not /api/users/12345.
  3. Error classes, not error messages. Bounded enum.
  4. Histograms for distributions. Fixed bucket count.
  5. Monitor cardinality. Top-N-by-metric query on a regular dashboard.
  6. Cardinality ceilings as alerts. Catch accidents within minutes.
  7. Different tools for different questions. Metrics for aggregates, logs/traces/events for per-request.
  8. Review dependency-provided metrics. Istio, OpenTelemetry auto-instrumentation can add labels you did not plan for.

Key Concepts Summary

  • Cardinality = number of unique label combinations. Each combination is a time series.
  • Cardinality is the product of distinct values per label. Adding one high-cardinality label multiplies total series.
  • Prometheus (and most metrics backends) cost scales with series count. Memory, storage, and query time all suffer.
  • High-cardinality labels to avoid: user_id, trace_id, session_id, email, raw URL, timestamps, error messages.
  • Low-cardinality labels to prefer: route pattern, method, status class, region, version, error class.
  • Metrics answer aggregates; logs/traces answer per-request. Do not try to make metrics do both.
  • Monitor cardinality routinely. topk(10, count by (__name__) ({__name__=~".+"})) is your friend.
  • Set cardinality ceilings as alerts. The cheapest way to catch accidental explosions.
  • Wide-event tools (Honeycomb, Datadog APM) handle high cardinality natively for investigation use cases.
  • Histogram, not per-value label. For latency, size, or any distribution.

Common Mistakes

  • Adding user_id (or any per-entity ID) to metric labels. Near-guaranteed incident.
  • Putting full URLs with path parameters into labels. Use framework route pattern.
  • Free-text error messages as labels. Use a bounded error-class enum.
  • Enabling Istio/Envoy default metrics without reviewing. Default output is rich; trim to what you query.
  • No cardinality alert. Accidents snowball for days before anyone notices the Prometheus RAM climbing.
  • Treating Grafana Cloud/Datadog metrics as unlimited. They are not — high cardinality is billable.
  • Thinking small team, small scale, cardinality does not matter yet. It compounds; start with discipline.
  • Confusing per-pod labels (bounded by pod count) with per-user labels (unbounded by user count).
  • Using percentile labels (p="99") instead of histogram buckets. Histograms are designed for this; percentile-as-label is not.
  • Removing a high-cardinality label but forgetting to drop old series or clear retention — you may still hit OOM from old data.

KNOWLEDGE CHECK

Your Prometheus RAM has grown from 8 GB to 60 GB over a month. You check top metrics by cardinality and see `api_request_duration_seconds` has 4.2 million time series. One recent change: a dev added a `user_id` label to help debug a support ticket. What is the immediate remediation and the long-term fix?