Observability Fundamentals for Engineers

PromQL Fundamentals

PromQL is the query language for Prometheus. Most engineers learn three or four functions and stop there. That is usually enough for dashboards, but not enough for debugging incidents: where you need to slice, group, join, and compare across labels in ways that are not obvious the first time you see them.

This lesson covers the PromQL you will actually use: rate(), increase(), sum by, histogram_quantile(), and the join/offset/compare patterns engineers reach for when something is on fire.

KEY CONCEPT

PromQL works on vectors, not rows. Every query returns a set of labeled time series. Thinking in terms of vectors, not SQL-style rows, is the single biggest shift required to become fluent.

Instant vectors vs range vectors

PromQL has two fundamental data types:

Instant vector: the current value of every matching series at a single point in time. Example: http_requests_total.
Range vector: a window of samples over time. Example: http_requests_total[5m] (every sample in the last 5 minutes).

Most functions require one or the other. rate() takes a range vector. sum() takes an instant vector. Mixing them is the most common PromQL error.

# Instant vector — current value of every series
http_requests_total

# Range vector — last 5 minutes of samples
http_requests_total[5m]

# rate() turns a range vector back into an instant vector
rate(http_requests_total[5m])

The shape of every query: start with a range vector, apply a function that collapses it to an instant vector, then aggregate.

rate(): the most important function

rate(counter[window]) returns the per-second average rate of increase of a counter over the window. It is the function you will use more than any other.

rate(http_requests_total[5m])

This says: for each time series of http_requests_total, compute the average requests per second over the last 5 minutes.

Why not just look at the counter? Counters only ever go up. The absolute value is meaningless, what matters is how fast it is increasing. rate() answers that.

PRO TIP

Always pick a window at least 4× your scrape interval. If you scrape every 15s, use [1m] minimum; [5m] is a safer default. Shorter windows are noisier; longer windows smooth out spikes but lag during incidents.

rate() also handles counter resets automatically. When a process restarts and the counter drops to zero, rate() treats it as a continuation, not a huge negative spike.

rate() vs irate() vs increase()

Three closely related functions, three different use cases:

Rule of thumb: use rate() for everything except the specific cases where irate() or increase() clearly fit better.

WARNING

Never use irate() in alerts. It samples only the last two points, so it flickers wildly and causes alert storms. Always use rate() with a window of 5 minutes or more for alerting.

Aggregation: sum, avg, max, min

Raw rate() gives you one time series per label combination. That is almost never what you want on a dashboard. You want to aggregate.

# One series per (method, status, instance, pod) — dozens of lines
rate(http_requests_total[5m])

# Total request rate across the entire service — one line
sum(rate(http_requests_total[5m]))

# Rate by status code — one line per status code
sum by (status) (rate(http_requests_total[5m]))

# Rate excluding specific labels (collapse pod/instance but keep method/status)
sum without (pod, instance) (rate(http_requests_total[5m]))

sum by (...) keeps the listed labels and drops everything else. sum without (...) drops the listed labels and keeps everything else. Pick whichever is shorter to write.

KEY CONCEPT

The standard PromQL pattern is aggregation( rate( counter[window] ) ). Inside-out: take a counter, compute a rate, sum or avg across the labels you do not care about. Almost every dashboard query follows this shape.

Filtering with label matchers

Every PromQL query can filter on labels. The syntax lives inside {...}:

# Exact match
http_requests_total{status="500"}

# Regex match
http_requests_total{status=~"5.."}

# Negative match
http_requests_total{method!="GET"}

# Negative regex
http_requests_total{path!~"/health|/metrics"}

# Combine multiple
http_requests_total{service="api", status=~"5..", method="POST"}

Filters happen before aggregation. This pattern, filter then aggregate, is how you zero in on a specific subset:

# Error rate for the checkout service, by HTTP method
sum by (method) (
  rate(http_requests_total{service="checkout", status=~"5.."}[5m])
)

histogram_quantile: computing p95 and p99

Histograms store request durations as buckets. To turn buckets into a percentile, use histogram_quantile():

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

Three things happen in this query:

http_request_duration_seconds_bucket is the bucket counter series (one per le boundary).
rate(...[5m]) gives the per-second rate at each bucket.
histogram_quantile(0.99, ...) interpolates across buckets to estimate the 99th percentile.

WARNING

Histogram quantiles are estimates, not exact values. The accuracy depends on your bucket boundaries. If your buckets are 0.1, 0.5, 1, 5 seconds and the true p99 is 0.7s, the quantile function can only say somewhere between 0.5 and 1. Design your buckets around the latencies you actually care about.

Aggregating histograms across labels

This is where engineers most often get it wrong. You cannot sum quantiles:

# WRONG — this does not give you the overall p99
avg(histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])))

You have to aggregate the buckets first, then compute the quantile:

# CORRECT — sum buckets across all pods, then compute p99
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

The by (le) is critical. le is the bucket boundary label, you must preserve it through the aggregation or histogram_quantile cannot compute anything.

# p99 by HTTP method — keep le AND method, drop everything else
histogram_quantile(0.99,
  sum by (le, method) (rate(http_request_duration_seconds_bucket[5m]))
)

Binary operations between series

You can do math across series. Prometheus matches them by label:

# Error ratio — errors divided by total requests
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

If both sides aggregate down to a single series, the result is a single series. If both sides have labels, Prometheus pairs them up by matching label sets.

on / ignoring: controlling the join

When the label sets do not match exactly, you need to tell Prometheus how to join:

# Error ratio per service — join on 'service' only, ignore other labels
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  /
sum by (service) (rate(http_requests_total[5m]))

When the label sets are unequal (one side has more labels than the other), use group_left or group_right:

# CPU usage per pod, joined with pod metadata labels (many-to-one)
rate(container_cpu_usage_seconds_total[5m])
  * on (pod) group_left(namespace, team)
  kube_pod_labels

PRO TIP

group_left means the left side has many series and the right side has one, keep the left side cardinality. This is how you enrich metrics with labels from kube_pod_labels, kube_node_labels, etc.

Offset and comparison: this vs last week

offset shifts a query backwards in time. Pair it with a comparison operator to ask is today different from last week?

# Current error rate vs same time last week
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total{status=~"5.."}[5m] offset 1w))

If the ratio is greater than 1, you are seeing more errors than last week.

# Deployment traffic change — is the new deploy seeing less traffic?
sum(rate(http_requests_total[5m]))
  -
sum(rate(http_requests_total[5m] offset 10m))

The patterns you will use in incidents

Nine queries that cover most PromQL usage during an on-call page:

1. Error rate

sum(rate(http_requests_total{status=~"5.."}[5m]))

2. Error ratio

sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

3. p99 latency

histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

4. RPS by endpoint

sum by (path) (rate(http_requests_total[5m]))

5. Top 10 slow endpoints

topk(10,
  histogram_quantile(0.99,
    sum by (le, path) (rate(http_request_duration_seconds_bucket[5m]))
  )
)

6. CPU saturation

sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
  / on (pod) group_left()
kube_pod_container_resource_limits{resource="cpu"}

7. Memory saturation

sum by (pod) (container_memory_working_set_bytes)
  / on (pod) group_left()
kube_pod_container_resource_limits{resource="memory"}

8. Restart rate

sum by (pod) (rate(kube_pod_container_status_restarts_total[15m]))

9. Pod readiness

sum by (namespace) (kube_pod_status_ready{condition="true"})

PRO TIP

Save these as Grafana dashboard variables or as named recording rules in Prometheus. In a real incident you do not want to be typing histogram_quantile from scratch while the site is down.

Recording rules: precomputing expensive queries

Some queries are too expensive to run live. A histogram_quantile across 1000 pods with 20 bucket boundaries is a lot of arithmetic on every dashboard refresh. Recording rules compute them ahead of time:

groups:
  - name: http_latency
    interval: 30s
    rules:
      - record: job:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_request_duration_seconds_bucket[5m]))
          )

After this is loaded, your dashboard query becomes:

job:http_request_duration_seconds:p99

Same result, computed once per 30s instead of once per viewer.

KEY CONCEPT

The naming convention matters: level:metric:operations. job:http_request_duration_seconds:p99 says aggregated by job, from http_request_duration_seconds, computing p99. Prometheus ecosystem tooling (Grafana, alertmanager) relies on this convention being readable.

PromQL gotchas engineers hit

Gotcha 1: rate() on a gauge

rate() is only for counters. On a gauge it produces nonsense because gauges can go down, and rate() interprets down as a counter reset.

# WRONG — memory is a gauge
rate(container_memory_working_set_bytes[5m])

# RIGHT — use delta() or deriv() for gauges
deriv(container_memory_working_set_bytes[5m])

Gotcha 2: label mismatch breaks joins silently

# This returns EMPTY if the left and right have different label sets
A / B

No error, just no data. Always use sum by (...) on both sides to force them to the same label set, or use on/ignoring explicitly.

Gotcha 3: matching across metric names is expensive

{__name__=~"..."} forces a full scan of the TSDB. Fine for ad-hoc exploration; do not put it in dashboards or alerts.

Gotcha 4: offset does not shift the range

# This is the last 5 minutes, as of 1 week ago
rate(http_requests_total[5m] offset 1w)

# NOT a 1-week window from last week — there is no such thing

Quiz

KNOWLEDGE CHECK

You want to alert when the 99th percentile latency across your fleet of 50 API pods exceeds 500ms. Which PromQL query is correct?

What to take away

PromQL works on vectors; every query returns a set of labeled time series.
rate(counter[5m]) is the single most important function. Use it for nearly everything.
Aggregate with sum by (...) or sum without (...). Pick the shorter one.
Histogram quantiles: always sum buckets first, compute quantile second, preserve the le label.
Use offset to compare now vs a week ago; use recording rules to precompute expensive queries.

Next lesson: how to write good metrics from the start so your PromQL queries stay sane as the system grows.

The Prometheus Data Model

Continue

Writing Good Metrics

←→ navigateM toggle sidebar