Observability Fundamentals for Engineers

SLI: What You Measure

Every engineering team has metrics. Very few have the right metrics for answering the one question that matters: are my users having a good time using my service?

An SLI. Service Level Indicator, is a specific, measurable signal that answers that question. Not "is CPU under 80%," not "are pods running," not "is the queue length OK." Something that, when it moves, directly reflects user experience.

This lesson is about how to pick SLIs that matter, how to express them precisely, and how to avoid the most common trap: measuring what is easy instead of what is right.

KEY CONCEPT

An SLI is a ratio of good events to total events over a time window. That is the entire definition. Everything else is a special case of it.

The anatomy of a good SLI

A good SLI has three properties:

User-centric. If this number moves, a user notices. If it changes and no user cares, it is not an SLI.
Ratio form. Almost all SLIs are expressed as good / total, "fraction of requests that succeeded," "fraction of writes that were acknowledged."
Measurable from the right vantage point. Measure from as close to the user as possible. Client-side if you can get it. The load balancer if you cannot. The service itself only as a last resort.

SLI = good_events / valid_events  # over a time window

Why ratio form? Because ratios are bounded (0 to 1), they compose across services, they are comparable across time regardless of traffic volume, and they translate directly into SLOs.

The four canonical SLI categories

Most services need SLIs in these four categories. Not all of them, not always in this order, but these are the shapes of what you measure:

Most user-facing APIs need availability and latency SLIs. Data pipelines need freshness and correctness SLIs. Search engines need all four. Pick what matters for your product.

Availability SLI: the most common

The canonical availability SLI for an HTTP service:

availability = sum(rate(http_requests_total{status!~"5.."}[5m]))
             /
               sum(rate(http_requests_total[5m]))

Translation: "the fraction of requests that did not return a 5xx."

Key decisions

Which status codes count as bad?

5xx is always bad (your fault).
4xx is usually not counted (user's fault, they sent bad input).
429 (rate limited) is a judgment call, if you are throttling aggressively, users see it as a failure.
499 (client closed) is a trap, sometimes your fault (too slow), sometimes the user (closed the tab). Decide based on context.

What counts as a valid request?

Health checks should be excluded. Always.
Automated probes (uptime monitors): include or exclude? Tends to depend on whether you own them.
Internal / privileged traffic: usually excluded; it does not reflect user experience.

# Typical final form — exclude health checks, ignore 4xx
sum(rate(http_requests_total{path!~"/health|/ready|/metrics", status!~"5.."}[5m]))
  /
sum(rate(http_requests_total{path!~"/health|/ready|/metrics"}[5m]))

Latency SLI: threshold, not percentile

The single biggest mistake engineers make with latency SLIs: they measure the p99, not the fraction of requests under a threshold.

Wrong: percentile as SLI

latency_p99 < 300ms

This looks right. It is wrong. Problems:

You cannot average it. p99 of p99s is not a p99.
It does not express the user experience in SLO form (% of good requests).
It moves around as traffic composition changes even when user experience is stable.

Right: ratio of fast requests

sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  /
sum(rate(http_request_duration_seconds_count[5m]))

Translation: "fraction of requests completed in under 300ms."

This SLI:

Is a ratio (0 to 1). Composes. Averages. Aggregates.
Directly translates to an SLO ("99% of requests complete in under 300ms").
Honest about the bimodal nature of latency, many fast, a long tail of slow.

KEY CONCEPT

Percentiles describe the distribution. SLIs describe user experience. Use percentiles in dashboards; use ratios in SLIs.

Picking the latency threshold

What do users notice? Typical rule: 100ms feels instant, 300ms feels snappy, 1s feels slow, 3s feels broken.
What does the competition look like? If your peer services respond in 200ms, 800ms feels slow.
What is technically possible? The threshold should be ambitious enough to matter but achievable.

For most user-facing APIs, pick something between 200ms and 500ms. For backend-to-backend, you have more headroom.

Correctness SLI: the hardest one

Correctness asks "was the response actually right?" which is much harder to measure than "did it return a 200."

Examples of correctness SLIs

Search: fraction of queries where the top result is clicked by a human (implicit relevance).
Payments: fraction of reconciled transactions where the ledger matches.
Image processing: fraction of resized images that pass a checksum check against the source.
ML inference: fraction of predictions that match a ground-truth label (from sampling).

Often you measure correctness via sampling + an external validation process. You do not need 100% coverage, 1-5% sampled and validated is usually enough to detect regressions.

Freshness SLI: for async and cache

Freshness measures time lag. For a data pipeline:

freshness = fraction of records ingested within N minutes of the source event

For a cache:

freshness = fraction of cache entries that are < N minutes old

For search indexes:

freshness = fraction of new documents searchable within N minutes

Freshness SLIs usually require instrumentation at both ends: timestamp at the source, timestamp at the reader, difference at read time.

The most common SLI anti-patterns

Anti-pattern 1: measuring what is easy instead of what matters

# WRONG
cpu_usage < 80%
pod_count = expected_count
disk_free > 10%

None of these reflect user experience. CPU can spike and users never notice. A pod can be down and the remaining ones can serve everything fine.

Anti-pattern 2: measuring from the wrong vantage point

Measuring errors from inside your service misses errors that never reach you (DNS, LB, network). Measure from the client if you can (RUM for web, client instrumentation for mobile). Fall back to the edge (load balancer logs).

Anti-pattern 3: including noise

A lot of teams count their own load balancer health checks as "requests" and boost their availability SLI to 99.99% because of it. Exclude synthetic traffic from SLIs.

Anti-pattern 4: one SLI per service, regardless of endpoint

A /search endpoint and a /search-suggestions endpoint have different user expectations. Lumping them into one availability SLI means the 10 QPS of /search is washed out by the 10,000 QPS of /search-suggestions.

Break SLIs out by user-facing journey when the expectations differ.

PRO TIP

The test for a good SLI: would an engineer wake up for it at 2am? If the SLI dropping below target is not alarming enough to page someone, it is probably the wrong SLI.

Worked example: SLIs for a REST API

Suppose you run a checkout API. Here are the SLIs a mature team would define:

# 1. Availability — fraction of POST /checkout requests that succeed
sli.availability =
  sum(rate(http_requests_total{route="/checkout", method="POST", status!~"5.."}[5m]))
    /
  sum(rate(http_requests_total{route="/checkout", method="POST"}[5m]))

# 2. Latency — fraction completed in under 500ms
sli.latency =
  sum(rate(http_request_duration_seconds_bucket{route="/checkout", method="POST", le="0.5"}[5m]))
    /
  sum(rate(http_request_duration_seconds_count{route="/checkout", method="POST"}[5m]))

# 3. Correctness — fraction of checkouts where the resulting order matches the cart
#    (validated async by a reconciliation job)
sli.correctness =
  sum(rate(checkout_reconciled_total[1h]))
    /
  sum(rate(checkout_completed_total[1h]))

Three SLIs. Each is a ratio. Each is user-facing. Each answers a specific question.

From SLI to SLO: the next step

An SLI is just a number. An SLO is the target for that number. Typical transitions:

SLI: availability = successful_requests / total_requests

SLO: availability >= 99.9% over the last 28 days

The SLO is what makes the SLI operationally useful. We will get into SLOs in the next lesson: but everything downstream of SLOs (error budgets, alerts, burn rates) depends on having chosen the right SLI first.

WARNING

Do not pick SLOs before you pick SLIs. Teams that start with "we want 99.9%" and work backwards into a metric end up measuring the wrong thing. Define the SLI first, collect real data for a month, then pick the SLO target based on actual performance plus a stretch.

Quiz

KNOWLEDGE CHECK

You are setting up SLIs for a payment service. Which of these is the BEST availability SLI?

What to take away

An SLI is always a ratio: good events / valid events over a time window.
Pick SLIs from four categories: availability, latency, correctness, freshness. Use what matters for your product.
Measure from close to the user. Exclude synthetic traffic and health checks.
Latency SLIs are ratios of requests-under-threshold, not percentiles.
A good SLI is user-centric, a ratio, and measurable. If it doesn't meet all three, it is a metric, not an SLI.
Define SLIs before SLOs. Collect a month of data before choosing targets.

Next lesson: turning SLIs into SLOs: picking targets, time windows, and what it means to miss.

Tracing in Practice

Continue

SLO: What You Commit To

←→ navigateM toggle sidebar