Observability Fundamentals for Engineers

Writing Good Metrics

Most teams discover their metrics system is broken only when the bill lands or the query times out. The root cause is almost always the same: someone added a label they should not have, or named a metric in a way that duplicates an existing one, or instrumented something that should have been a log.

This lesson is about writing metrics that scale — naming, label design, what not to measure, and how to enforce cardinality budgets before they become a problem.

KEY CONCEPT

Metrics are a design decision, not an afterthought. The name and labels you pick today will shape every dashboard, alert, and PromQL query for years. Spend 10 minutes thinking about it up front.


The Prometheus naming convention

Every metric name follows a predictable pattern:

<namespace>_<subsystem>_<measurement>_<unit>_<suffix>

Examples from real exporters:

http_requests_total
http_request_duration_seconds
process_cpu_seconds_total
node_filesystem_avail_bytes
kube_pod_container_resource_limits

The parts:

  1. Namespace — the system the metric comes from: http, process, node, kube, grpc.
  2. Subsystem — optional, narrows the scope: request, container, filesystem.
  3. Measurement — what is being measured: duration, requests, avail.
  4. Unit — SI base units: seconds, bytes, meters. Never milliseconds, never kilobytes.
  5. Suffix_total for counters, _bucket/_count/_sum for histograms (added automatically).
WARNING

Always use base units. request_duration_seconds, not request_duration_ms. memory_bytes, not memory_mb. This is a hard rule in the Prometheus ecosystem — every library, every dashboard, every recording rule assumes base units. Breaking the convention means your metrics will not compose with anyone else's.


The five naming rules

Rule 1: counters end in _total

http_requests_total       OK
http_requests             WRONG — looks like a gauge

The _total suffix is a signal to readers (and to tooling) that this is a monotonically increasing counter. Prometheus client libraries add it automatically if you define a Counter.

Rule 2: use the unit in the name

http_request_duration_seconds    OK
http_request_duration            WRONG — what unit?
http_request_duration_ms         WRONG — not base unit

Rule 3: use singular nouns for things, plural for counts

http_requests_total              plural — counting events
http_request_duration_seconds    singular — measuring one thing per request
node_filesystem_avail_bytes      singular — a property of the filesystem

Rule 4: name the metric after what it measures, not where it comes from

http_requests_total                OK — measures HTTP requests
api_gateway_http_requests_total    OK — namespaced by the system
requests_from_the_auth_middleware  WRONG — describes source, not measurement

Rule 5: do not encode labels in the metric name

# WRONG — separate metric per status code
http_requests_200_total
http_requests_404_total
http_requests_500_total

# RIGHT — one metric, status as a label
http_requests_total{status="200"}
http_requests_total{status="404"}
http_requests_total{status="500"}

This is the single most common mistake. If two metric names differ only by a value that could be a label, they should be one metric with a label.


Label design — the most important decision

Labels are what make metrics queryable. They are also what blow up your Prometheus bill. Every unique label value combination creates a new time series.

GOOD LABELSBounded set of known valuesmethod~8 values: GET/POST/PUT/...status~10 values: 200/301/400/...routeRoute template, ~50 valuesserviceService name, ~20 valuesenvironmentprod/staging/dev — 3 valuesregionus-east-1/eu-west-1 — ~5Cardinality is predictable + cappedBAD LABELSUnbounded or user-controlled valuesuser_idMillions of unique IDsrequest_idUnique per requestfull_urlUnbounded with query paramsemailPer-user cardinalityerror_messageFree text = infinite unique valuestimestampNew series every secondThese belong in logs or traces, not labels
WARNING

The test for whether a label is safe: can you bound its values ahead of time? If the answer is no — the values come from users, URLs, payloads, or IDs — it is not a label. It is a log field.


The cardinality budget

Every team that runs Prometheus at scale ends up with a cardinality budget — an informal or formal limit on how many series each service is allowed to contribute.

A reasonable starting budget for a service:

Per-instance series:        5,000
Per-service total series:  50,000
Per-cluster total series: 2,000,000

When a service exceeds its budget, it gets a ticket, not a production incident. The budget is the trigger for review, not a hard block.

How to estimate cardinality

metric_cardinality = product of (cardinality of each label)

# Example:
http_requests_total{method, status, route, service}
                    4       10      50     20

cardinality = 4 * 10 * 50 * 20 = 40,000 series

Now add one more label — user_id with a million users — and the cardinality becomes:

4 * 10 * 50 * 20 * 1,000,000 = 40,000,000,000 series

That is a pager at 2am.

PRO TIP

Use topk(20, count by (__name__) ({__name__=~".+"})) to find your top 20 highest-cardinality metrics. Do this weekly. The worst offenders are usually obvious once you look.


What to measure — the RED method

For any request-driven service, instrument the RED signals:

  • Rate — requests per second
  • Errors — errors per second (or error ratio)
  • Duration — distribution of request durations

Three metrics cover 80% of what you need:

http_requests_total{service, route, method, status}
http_request_duration_seconds{service, route, method}  # histogram
http_requests_in_flight{service}  # gauge

For a resource (queue, connection pool, worker), use the USE method:

  • Utilization — percent of resource in use
  • Saturation — amount of work queued / waiting
  • Errors — errors from the resource

What NOT to measure as metrics

Not everything should be a metric. Some data belongs in logs or traces.

KEY CONCEPT

Heuristic: if you need to answer a question about a specific event (request, user, transaction), use logs or traces. If you need to answer a question about aggregate behavior (rates, distributions, percentages), use metrics.

Do not use metrics for

  • Per-request IDs — request IDs are cardinality bombs. Log them instead.
  • Free-text error messages — unbounded cardinality. Use a small number of bucketed error_type labels.
  • Timestamps of specific events — use logs for events.
  • Full URLs with query strings — normalize to the route template before labeling.
  • Customer / user identities — almost always belong in logs or traces for privacy and cardinality reasons.
  • Payload contents — never put request or response bodies in metrics.

Normalizing route labels

A common cardinality trap: using the raw request path as a label.

# BAD — each user ID becomes a new series
http_requests_total{path="/users/42/orders"}
http_requests_total{path="/users/43/orders"}
http_requests_total{path="/users/44/orders"}
# ... millions of series

# GOOD — normalize to the route template
http_requests_total{route="/users/:id/orders"}

Every HTTP framework provides access to the matched route template — use it. If your framework doesn't, normalize with regex before labeling:

path = regexp.MustCompile(`/users/\d+/`).ReplaceAllString(path, "/users/:id/")

Instrumenting a service — a concrete example

Here is a well-designed set of metrics for a typical HTTP API service:

// Counter — total requests
httpRequestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests processed",
    },
    []string{"method", "route", "status"},
)

// Histogram — request duration distribution
httpRequestDuration := prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request latency distribution",
        Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
    },
    []string{"method", "route"},
)

// Gauge — concurrent in-flight requests
httpInflight := prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "http_requests_in_flight",
        Help: "Current number of in-flight HTTP requests",
    },
    []string{"method"},
)

Three metrics, four bounded labels, maybe ~1,000 series per instance. Covers RED fully.


How to bound cardinality defensively

Even with good intentions, labels creep. Defensive techniques:

1. Whitelist label values

If a label like status occasionally gets an unexpected value (say, "unknown" or a raw exception type), bucket it explicitly:

func normalizeStatus(code int) string {
    switch {
    case code < 200: return "1xx"
    case code < 300: return "2xx"
    case code < 400: return "3xx"
    case code < 500: return "4xx"
    default: return "5xx"
    }
}

2. Cap high-cardinality labels

For labels that genuinely have a wide range (e.g. error_type), cap the cardinality with an other bucket:

knownErrors := map[string]bool{
    "timeout": true, "refused": true, "conn_reset": true,
    "dns": true, "tls": true, "auth": true,
}

func normalizeError(err string) string {
    if knownErrors[err] { return err }
    return "other"
}

3. Recording rules to drop cardinality

If a metric is already too wide, create a recording rule that aggregates away the offending label:

- record: http_requests_total:by_service
  expr: sum without (pod, instance) (rate(http_requests_total[5m]))

Then point your dashboards at the lower-cardinality version.


War story — the tenant label

A team I worked with added a tenant label to their http_requests_total metric. Seemed fine — they had 50 tenants. Two years later they had 5,000 tenants and 400 metrics with tenant as a label. The result:

  • 2 million active series
  • Prometheus queries timing out
  • Grafana dashboards taking 30 seconds to load
  • Bill from their managed Prometheus vendor jumped 8×

The fix was to drop tenant from all non-critical metrics and add it back only to a single counter (tenant_requests_total) used for billing dashboards. Cardinality dropped from 2M to 40K. Queries went from 30s to 200ms.

KEY CONCEPT

Cardinality problems are exponential. A label that seems fine today can explode next year as the business grows. Err on the side of fewer, bounded labels.


Quiz

KNOWLEDGE CHECK

You are instrumenting a checkout service. Which of these is the best metric design for measuring request latency?


What to take away

  • Follow the naming convention: namespace_subsystem_measurement_unit_suffix. Always use SI base units.
  • Every label multiplies cardinality. Use bounded, enumerable values only.
  • Never use user IDs, request IDs, raw URLs, emails, or free text as labels.
  • Instrument the RED signals for services and USE signals for resources.
  • Set a cardinality budget and check it weekly with topk() queries.
  • When in doubt, fewer labels are better than more.

Next module: logs — what they are for, how to structure them, and when to reach for them instead of metrics.