Observability Fundamentals for Engineers

SLO: What You Commit To

An SLI is a measurement. An SLO. Service Level Objective, is a promise: the fraction of time this measurement must be above a target.

SLOs are where observability stops being a technical exercise and becomes an engineering-culture decision. A team with well-chosen SLOs has a structured way to decide between "ship the feature" and "stop and fix reliability." A team without SLOs makes that decision by gut feel, which almost always means fixing reliability too late and too little.

KEY CONCEPT

The SLO is a decision-making tool, not a pass/fail grade. It tells you when to stop shipping and fix things. If you never miss your SLO, it is too loose. If you always miss it, it is too tight. The right SLO is slightly harder than your current state.

The shape of an SLO

An SLO has three parts:

<SLI>  target  <percentage>  over  <time window>

For example:

availability SLI >= 99.9% over the last 28 days

Translation: "In any trailing 28-day window, at least 99.9% of requests must have been successful." That is one number (99.9), one time window (28 days), and one SLI it applies to (availability).

The "nines": what they actually mean

Availability targets are usually written in "nines." Here is what they buy you in real downtime:

Most user-facing internet services target 99.9% to 99.95%. Core infrastructure (payments, auth) targets 99.95% to 99.99%. Internal / dev tools often target 99% to 99.5%. Five nines is banking / telecom territory and is not realistic for most software.

Why 100% is always wrong

Engineers who have not set SLOs before tend to reason: "we should target 100%, obviously." This is wrong for three reasons:

1. The cost curve is exponential

Going from 99% to 99.9% might cost 2× in infrastructure + engineering time. Going from 99.9% to 99.99% costs 10×. Going from 99.99% to 99.999% costs another 10×. The last few nines are where reliability engineering consumes your entire budget.

2. Users cannot tell the difference

The chain of things between your service and a user (DNS, ISP, device, browser, WiFi) averages around 99.8% reliability end-to-end. Spending engineering effort to push your service from 99.99% to 99.999% literally cannot be perceived by a user, the noise floor of the rest of the internet is higher than the improvement you delivered.

3. You need the budget for change

Every deploy, every config change, every experiment has some risk of causing failures. If you target 100%, every mistake burns the budget you do not have. Teams that target 100% end up either terrified of change or dishonest about their uptime.

KEY CONCEPT

SLOs are about picking a level of reliability that is good enough for users, and using the remainder as a budget for change. That is what an "error budget" is, next lesson.

The time window

The time window is how long ago you look back to compute the SLO. Common choices:

28 days (rolling): the most common. Close to a month, always whole weeks, easy to reason about.
30 days (calendar): easier to explain to non-engineers, slightly harder to query cleanly.
7 days (rolling): for faster-moving services, more reactive.
90 days: for anything contractual (customer-facing SLAs).

The longer the window, the more stable the number. The shorter the window, the faster you react to recent changes. Most teams use a 28-day window for the canonical SLO and a 7-day or 1-day burn rate window for alerting (covered in the next lesson).

PRO TIP

Do not mix windows across dashboards. If your SLO is 28-day, every panel that references the SLO should use 28 days. Mixing 7-day and 30-day views of "the same SLO" causes endless confusion.

Choosing the SLO target

The right way to pick an SLO target:

Measure for a month. Collect SLI data for 4-6 weeks. Learn what your current state is.
Start slightly above current state. If you are at 99.5% availability today, start the SLO at 99.7%. That is a target you can probably hit, but not trivially.
Tighten over time. Every quarter, review. If you consistently hit the SLO, tighten it. If you consistently miss it, either invest in reliability or loosen it deliberately.

The SLO is not a promise for the next 10 years. It is a target for this quarter.

The three forms of an SLO

Different kinds of SLIs lead to different SLO phrasings:

Availability / correctness

99.9% of requests succeed over 28 days

Latency

95% of requests complete in under 500ms over 28 days

Freshness

99% of records are ingested within 5 minutes of source over 28 days

Notice that the "target" number is lower for latency (95%) than for availability (99.9%). That is normal, achieving 99.9% of requests under 500ms is much harder than achieving 99.9% success.

The multi-SLO service

Most real services have more than one SLO. A checkout API might have:

SLO 1: availability       >= 99.95% over 28 days
SLO 2: p95 latency        < 500ms,   99% compliance over 28 days
SLO 3: p99 latency        < 2s,      99% compliance over 28 days
SLO 4: checkout correctness >= 99.99% over 28 days

The SLO for the service is the conjunction: all four have to be within target. Missing any of them is a miss.

WARNING

Do not create SLOs for metrics that do not affect users. Each SLO has a real ongoing maintenance cost (alerts, dashboards, reviews, postmortems). Five SLOs is a lot. Ten is too many. Each one must be worth the attention cost.

Journey-based SLOs

For user-facing products, the most useful SLOs are often defined per user journey, a sequence of endpoints that together make up a user-visible experience.

For example, for an e-commerce site:

Browse products (GET /products, GET /products/:id), availability >= 99.95%, latency < 300ms
Search (GET /search): availability >= 99.9%, latency < 600ms (search can be slower)
Checkout (POST /cart, POST /checkout, POST /payments), availability >= 99.99% (this one matters most), latency < 1s

Journey SLOs map to actual user experience. "API is up" does not, if the checkout path is down and everything else works, the user cannot buy anything.

SLOs vs SLAs

Two closely related but distinct terms:

SLO (Service Level Objective): internal target. What the team commits to. Miss an SLO: you have an internal problem to solve.
SLA (Service Level Agreement): contractual promise to customers. Miss an SLA: you owe the customer credit or a refund.

SLAs are always looser than SLOs. A team targets an SLO of 99.95% internally and offers an SLA of 99.9% to customers. The gap is the buffer, miss the SLO and you still have room before the SLA is broken.

KEY CONCEPT

Never set your SLA equal to your SLO. The moment you have a bad week, you go straight from "internal concern" to "legal concern." Build in a gap.

The SLO review cadence

An SLO is not set once and forgotten. The operating rhythm:

Weekly: glance at the SLO dashboard. Any on-fire SLOs?
Monthly: 15-minute SLO review per service. Trends. Misses. Near-misses.
Quarterly: deep review. Should any SLOs tighten? Should any loosen? Do the SLIs still reflect user experience?
Yearly: are these still the right SLOs at all? Has the product changed enough to warrant new ones?

Teams that skip these reviews discover a year later that they have been measuring the wrong things, or that their SLO is trivial to hit because the product changed.

A worked example: picking SLOs for a new service

You are launching a new internal billing API. Current state (measured for 4 weeks):

Availability: 99.93%
p95 latency: 380ms (we want under 500ms)
p95 latency is under 500ms: 94% of the time

Reasonable starting SLOs:

SLO 1: availability >= 99.9% over 28 days
    Rationale: current state is 99.93%. Target of 99.9% is achievable.

SLO 2: 95% of requests complete in < 500ms over 28 days
    Rationale: current compliance is 94%. Target of 95% is achievable but tight.

Review in Q+1. If SLO 1 is consistently at 99.97%, tighten to 99.95%.
If SLO 2 is consistently at 99%, tighten to 98%.

Deliberately conservative. Deliberately tighter than current state. Deliberately time-boxed.

Quiz

KNOWLEDGE CHECK

You are setting SLOs for an internal developer tool used by ~50 engineers. Current availability is around 99.5%. What is the right SLO target to set initially?

What to take away

An SLO is <SLI> >= <target%> over <time window>.
100% is always the wrong target. The last nine costs more than all the previous nines combined and users cannot tell the difference.
Typical SLO targets: 99% (internal tools), 99.9% (most services), 99.95% (payments / auth), 99.99% (rare; critical infra).
Use a 28-day rolling window by default.
SLAs are customer-facing and always looser than SLOs. Never set them equal.
Pick an SLO slightly harder than current state. Review quarterly. Tighten when you consistently hit it.
Multi-SLO services are normal; keep the number small. Five SLOs is a lot.

Next lesson: error budgets, turning the gap between your SLO and 100% into a concrete engineering-decision tool.

SLI: What You Measure

Continue

Error Budgets and Decision Making

←→ navigateM toggle sidebar