Writing Good Metrics
Most teams discover their metrics system is broken only when the bill lands or the query times out. The root cause is almost always the same: someone added a label they should not have, or named a metric in a way that duplicates an existing one, or instrumented something that should have been a log.
This lesson is about writing metrics that scale — naming, label design, what not to measure, and how to enforce cardinality budgets before they become a problem.
Metrics are a design decision, not an afterthought. The name and labels you pick today will shape every dashboard, alert, and PromQL query for years. Spend 10 minutes thinking about it up front.
The Prometheus naming convention
Every metric name follows a predictable pattern:
<namespace>_<subsystem>_<measurement>_<unit>_<suffix>
Examples from real exporters:
http_requests_total
http_request_duration_seconds
process_cpu_seconds_total
node_filesystem_avail_bytes
kube_pod_container_resource_limits
The parts:
- Namespace — the system the metric comes from:
http,process,node,kube,grpc. - Subsystem — optional, narrows the scope:
request,container,filesystem. - Measurement — what is being measured:
duration,requests,avail. - Unit — SI base units:
seconds,bytes,meters. Never milliseconds, never kilobytes. - Suffix —
_totalfor counters,_bucket/_count/_sumfor histograms (added automatically).
Always use base units. request_duration_seconds, not request_duration_ms. memory_bytes, not memory_mb. This is a hard rule in the Prometheus ecosystem — every library, every dashboard, every recording rule assumes base units. Breaking the convention means your metrics will not compose with anyone else's.
The five naming rules
Rule 1: counters end in _total
http_requests_total OK
http_requests WRONG — looks like a gauge
The _total suffix is a signal to readers (and to tooling) that this is a monotonically increasing counter. Prometheus client libraries add it automatically if you define a Counter.
Rule 2: use the unit in the name
http_request_duration_seconds OK
http_request_duration WRONG — what unit?
http_request_duration_ms WRONG — not base unit
Rule 3: use singular nouns for things, plural for counts
http_requests_total plural — counting events
http_request_duration_seconds singular — measuring one thing per request
node_filesystem_avail_bytes singular — a property of the filesystem
Rule 4: name the metric after what it measures, not where it comes from
http_requests_total OK — measures HTTP requests
api_gateway_http_requests_total OK — namespaced by the system
requests_from_the_auth_middleware WRONG — describes source, not measurement
Rule 5: do not encode labels in the metric name
# WRONG — separate metric per status code
http_requests_200_total
http_requests_404_total
http_requests_500_total
# RIGHT — one metric, status as a label
http_requests_total{status="200"}
http_requests_total{status="404"}
http_requests_total{status="500"}
This is the single most common mistake. If two metric names differ only by a value that could be a label, they should be one metric with a label.
Label design — the most important decision
Labels are what make metrics queryable. They are also what blow up your Prometheus bill. Every unique label value combination creates a new time series.
The test for whether a label is safe: can you bound its values ahead of time? If the answer is no — the values come from users, URLs, payloads, or IDs — it is not a label. It is a log field.
The cardinality budget
Every team that runs Prometheus at scale ends up with a cardinality budget — an informal or formal limit on how many series each service is allowed to contribute.
A reasonable starting budget for a service:
Per-instance series: 5,000
Per-service total series: 50,000
Per-cluster total series: 2,000,000
When a service exceeds its budget, it gets a ticket, not a production incident. The budget is the trigger for review, not a hard block.
How to estimate cardinality
metric_cardinality = product of (cardinality of each label)
# Example:
http_requests_total{method, status, route, service}
4 10 50 20
cardinality = 4 * 10 * 50 * 20 = 40,000 series
Now add one more label — user_id with a million users — and the cardinality becomes:
4 * 10 * 50 * 20 * 1,000,000 = 40,000,000,000 series
That is a pager at 2am.
Use topk(20, count by (__name__) ({__name__=~".+"})) to find your top 20 highest-cardinality metrics. Do this weekly. The worst offenders are usually obvious once you look.
What to measure — the RED method
For any request-driven service, instrument the RED signals:
- Rate — requests per second
- Errors — errors per second (or error ratio)
- Duration — distribution of request durations
Three metrics cover 80% of what you need:
http_requests_total{service, route, method, status}
http_request_duration_seconds{service, route, method} # histogram
http_requests_in_flight{service} # gauge
For a resource (queue, connection pool, worker), use the USE method:
- Utilization — percent of resource in use
- Saturation — amount of work queued / waiting
- Errors — errors from the resource
What NOT to measure as metrics
Not everything should be a metric. Some data belongs in logs or traces.
Heuristic: if you need to answer a question about a specific event (request, user, transaction), use logs or traces. If you need to answer a question about aggregate behavior (rates, distributions, percentages), use metrics.
Do not use metrics for
- Per-request IDs — request IDs are cardinality bombs. Log them instead.
- Free-text error messages — unbounded cardinality. Use a small number of bucketed
error_typelabels. - Timestamps of specific events — use logs for events.
- Full URLs with query strings — normalize to the route template before labeling.
- Customer / user identities — almost always belong in logs or traces for privacy and cardinality reasons.
- Payload contents — never put request or response bodies in metrics.
Normalizing route labels
A common cardinality trap: using the raw request path as a label.
# BAD — each user ID becomes a new series
http_requests_total{path="/users/42/orders"}
http_requests_total{path="/users/43/orders"}
http_requests_total{path="/users/44/orders"}
# ... millions of series
# GOOD — normalize to the route template
http_requests_total{route="/users/:id/orders"}
Every HTTP framework provides access to the matched route template — use it. If your framework doesn't, normalize with regex before labeling:
path = regexp.MustCompile(`/users/\d+/`).ReplaceAllString(path, "/users/:id/")
Instrumenting a service — a concrete example
Here is a well-designed set of metrics for a typical HTTP API service:
// Counter — total requests
httpRequestsTotal := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests processed",
},
[]string{"method", "route", "status"},
)
// Histogram — request duration distribution
httpRequestDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency distribution",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "route"},
)
// Gauge — concurrent in-flight requests
httpInflight := prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "http_requests_in_flight",
Help: "Current number of in-flight HTTP requests",
},
[]string{"method"},
)
Three metrics, four bounded labels, maybe ~1,000 series per instance. Covers RED fully.
How to bound cardinality defensively
Even with good intentions, labels creep. Defensive techniques:
1. Whitelist label values
If a label like status occasionally gets an unexpected value (say, "unknown" or a raw exception type), bucket it explicitly:
func normalizeStatus(code int) string {
switch {
case code < 200: return "1xx"
case code < 300: return "2xx"
case code < 400: return "3xx"
case code < 500: return "4xx"
default: return "5xx"
}
}
2. Cap high-cardinality labels
For labels that genuinely have a wide range (e.g. error_type), cap the cardinality with an other bucket:
knownErrors := map[string]bool{
"timeout": true, "refused": true, "conn_reset": true,
"dns": true, "tls": true, "auth": true,
}
func normalizeError(err string) string {
if knownErrors[err] { return err }
return "other"
}
3. Recording rules to drop cardinality
If a metric is already too wide, create a recording rule that aggregates away the offending label:
- record: http_requests_total:by_service
expr: sum without (pod, instance) (rate(http_requests_total[5m]))
Then point your dashboards at the lower-cardinality version.
War story — the tenant label
A team I worked with added a tenant label to their http_requests_total metric. Seemed fine — they had 50 tenants. Two years later they had 5,000 tenants and 400 metrics with tenant as a label. The result:
- 2 million active series
- Prometheus queries timing out
- Grafana dashboards taking 30 seconds to load
- Bill from their managed Prometheus vendor jumped 8×
The fix was to drop tenant from all non-critical metrics and add it back only to a single counter (tenant_requests_total) used for billing dashboards. Cardinality dropped from 2M to 40K. Queries went from 30s to 200ms.
Cardinality problems are exponential. A label that seems fine today can explode next year as the business grows. Err on the side of fewer, bounded labels.
Quiz
You are instrumenting a checkout service. Which of these is the best metric design for measuring request latency?
What to take away
- Follow the naming convention:
namespace_subsystem_measurement_unit_suffix. Always use SI base units. - Every label multiplies cardinality. Use bounded, enumerable values only.
- Never use user IDs, request IDs, raw URLs, emails, or free text as labels.
- Instrument the RED signals for services and USE signals for resources.
- Set a cardinality budget and check it weekly with
topk()queries. - When in doubt, fewer labels are better than more.
Next module: logs — what they are for, how to structure them, and when to reach for them instead of metrics.