Observability Fundamentals for Engineers

Error Budgets and Decision Making

An SLO is a promise. An error budget is the permission that promise grants you to be imperfect.

If your SLO is 99.9%, your error budget is 0.1%, the fraction of requests that are allowed to fail. Use that budget wisely and you get to ship features at full speed. Burn through it and you stop shipping features and fix reliability.

This is the mechanism that turns observability from a monitoring exercise into an engineering-culture tool. Error budgets replace gut-feel ("is it time to work on stability?") with data ("we have used 80% of our budget; time to slow down").

KEY CONCEPT

The error budget turns reliability into a currency. You can spend it on deploys, experiments, or chaos testing. But once it is gone, it is gone, no more risky changes until next window.

Calculating the error budget

Given:

SLO: availability >= 99.9% over 28 days
Traffic: 1,000,000 requests per day

Error budget math:

Budget rate   = 100% - 99.9% = 0.1%
Budget count  = 0.1% × 28 days × 1,000,000 requests/day
              = 28,000 failed requests allowed over 28 days

That is your budget. Every failed request in the 28-day window draws from it.

If you have used 5,000 failed requests with 10 days still remaining in the window, you have 23,000 budget left and a natural burn rate. You are fine.

If you have used 25,000 failed requests with 10 days still remaining, you are about to run out. Stop taking risky actions.

Time-based error budgets

If you cannot count requests, e.g. a system whose SLO is "available 99.9% of time," not "99.9% of requests", you can use time:

Budget time = (100% - 99.9%) × 28 days
            = 0.001 × 28 × 24 × 60 minutes
            = 40.3 minutes of downtime allowed per 28 days

Request-based is almost always better because it scales with traffic, a 1-minute outage during peak hours hurts more than during idle hours, and a request-based budget reflects that.

Burn rate: how fast are you spending it?

The burn rate is how fast you are consuming the error budget relative to the budget's natural spend rate.

natural burn rate = 1x  (spending the budget exactly evenly — you will end the window at 0 budget)
2x burn           = spending twice as fast (budget gone in half the window)
10x burn          = burning 10x — budget gone in 2.8 days
100x burn         = burning 100x — budget gone in 6.7 hours

A 1× burn rate means you are tracking exactly to the SLO. A 10× burn rate means if nothing changes, you will exhaust your 28-day budget in 2.8 days.

Burn rate is the key concept for alerting. It converts "error rate" from a raw number into "is this bad given our SLO?"

Burn rate alerts: the right way to alert on SLOs

Traditional alerts, "error rate > 5%", are noisy. They fire on brief blips that do not threaten the SLO, and they go silent on slow steady drains that do.

Burn rate alerts solve both:

ALERT: error budget is being consumed at 14x the sustainable rate

This fires when it matters and stays quiet when it doesn't.

Multi-window, multi-burn alerts

The state of the art is Google's "multi-window, multi-burn-rate" alerting, which combines two burn-rate windows into one alert. Fire only if both are true, short-window for immediate signal, long-window to suppress flaps.

A standard production setup:

Severity	Alert fires when
Page (fast burn)	Burn rate >= 14× over 1h AND burn rate >= 14× over 5m
Page (slow burn)	Burn rate >= 6× over 6h AND burn rate >= 6× over 30m
Ticket	Burn rate >= 3× over 24h AND burn rate >= 3× over 2h

The fast-burn alert catches outages (high burn, sharp). The slow-burn alert catches sustained degradations. The ticket catches long-tail issues that are not urgent but need work.

The short-window condition is what prevents the alert from flapping on a 30-second blip. The long-window condition is what prevents it from firing needlessly when an outage is already resolved.

KEY CONCEPT

Never alert on raw error rate. Burn-rate alerts are fewer, more actionable, and tied to the SLO. They are the single biggest alert-quality win a team can adopt.

Burn rate alerts in PromQL

# Fast burn, 1h window AND 5m window
(
  (1 - availability_sli_1h) / (1 - 0.999) > 14
  AND
  (1 - availability_sli_5m) / (1 - 0.999) > 14
)

Where availability_sli_1h and availability_sli_5m are recording rules for the SLI over those windows. Most teams build reusable macros or use a tool (Pyrra, Sloth, OpenSLO) to generate these rules from SLO definitions.

The error budget policy

An error budget without a policy for what happens when it runs out is just a number on a dashboard. The policy is the organizational contract that gives the budget teeth.

A typical policy:

If the 28-day error budget is > 30% remaining:
  - Full feature development pace
  - Experiments, canaries, chaos testing permitted

If the budget is 10-30% remaining:
  - Continue feature work
  - Hold risky changes (large migrations, infra upgrades)
  - Schedule postmortems on recent incidents

If the budget is 0-10% remaining:
  - Freeze feature launches
  - Redirect 25% of team capacity to reliability work
  - Daily SLO review

If the budget is exhausted (< 0):
  - Feature freeze until budget recovers
  - All hands on reliability work
  - Reviewed at engineering leadership level

The specific thresholds and responses vary by team. The shape is universal: more budget = more freedom; less budget = more discipline.

WARNING

A budget policy is only valuable if the organization will enforce it. If PMs can override "feature freeze" for any launch, the policy is theatre. Leadership has to actually say "no new features until the budget recovers" at least once, or nobody believes it.

Consequences of running out

When the budget runs out, the "cost" takes forms like:

Feature freeze. No new features until the next window starts.
Reliability work prioritized. All the postmortem TODOs finally get done.
Deploys become more cautious. Smaller changes, more canary time.
Chaos experiments paused. You cannot afford the disruption.
Cross-team attention. Upstream services that caused the burn get engineering focus.

The goal is not to punish. The goal is to make the default response to an SLO miss automatic, you do not need to negotiate whether to invest in reliability, because the policy already decided.

Error budget as a negotiation tool

The most underrated use of error budgets: negotiating with product management.

Without a budget:

PM: "we need to ship feature X next week"
SRE: "no, it's not stable enough"
PM: "we have to ship it, the CEO promised the board"
(ships anyway, breaks things)

With a budget:

PM: "we need to ship feature X next week"
SRE: "we are at 8% remaining budget. The policy says we only ship non-risky changes. Is this risky?"
PM: "it's a big change, yes"
SRE: "then we either defer, or decide together to burn budget and accept the risk, with leadership approval. Here is the cost."

The budget makes the tradeoff explicit. It is not "engineering wants more time." It is "we have this much reliability currency. Spending it here means we cannot spend it on other things."

PRO TIP

A healthy team occasionally runs the budget down. That means you are making full use of the budget to ship. A team that always has 90% of budget left is being too cautious, they could be shipping faster.

The recovery pattern

When you blow the budget, here is the recovery pattern that works:

Postmortem within 5 days. Blameless, specific, action-oriented.
Identify 1-3 TODOs that, if they had been done, would have prevented the incident.
Do those TODOs before anything else. Put them ahead of whatever you were working on.
Chaos-test the fix. Simulate the failure mode; verify the fix holds.
Move on. Do not demand "no incidents" forever. Incidents happen; the question is how quickly and effectively you recover.

Error budget policy: what NOT to do

Anti-pattern 1: move the SLO when you blow the budget

The temptation: "our SLO is 99.95% but we keep missing it. Let us change it to 99.9%." This is almost always the wrong answer. The SLO reflected a reality about what users need. Lowering it because you cannot hit it is lying to yourself. Either invest in reliability or honestly renegotiate with users about what you will deliver.

Anti-pattern 2: make the budget infinite in practice

"We blew the budget but there is no freeze because we need to ship." If the policy has no teeth, the budget has no value. Either enforce the policy or get rid of it.

Anti-pattern 3: blame the on-call

The budget is a team property, not an individual's responsibility. If someone made a bad deploy that burned the budget, the team's systems failed too, why did the canary miss it? Why did the alert not fire earlier? Budget loss is a process failure, not a person.

A fully worked example

Setup:

SLO: availability >= 99.95% over 28 days
Traffic: 5,000 RPS steady
28-day window: 5,000 × 86,400 × 28 = 12.1B requests
Budget: 0.05% × 12.1B = 6.05M failed requests allowed in 28 days
Budget per day (1× burn): ~216,000 failed requests

Week 1 of the window:

Day 1-3: normal. ~50k failed requests. Budget spent: ~150k. Budget remaining: 5.9M. All good.
Day 4: a bad deploy. Error rate spikes to 2% for 4 hours. Failed requests that day: ~600k.
Total budget spent: ~750k. Budget remaining: 5.3M. Still plenty.
Day 5-6: recovered, normal traffic. ~100k failed. Budget spent: 850k. Remaining: 5.2M.
Day 7: secondary incident from that same bad deploy still causing issues on cache layer. 1.5% error rate for 6 hours. ~540k failed.
Budget spent: 1.4M. Remaining: 4.65M. 22% of 28-day budget used in 7 days.

Burn rate over the 7-day window: 22% spent in 7/28 = 25% of the window = burn rate of 0.88×. Under 1×, we are on track to finish the 28-day window with budget left.

But the last 3 days had a 2-day burn rate of (600k + 540k) / (2 days * 216k) = ~2.65× burn rate. That is a ticket-level alert.

Decision: no feature freeze. Postmortem the bad deploy. Tighten canary thresholds. Stop the bleed.

Quiz

KNOWLEDGE CHECK

Your SLO is 99.9% availability over 28 days. Over the last hour, error rate has been 5%. What is the right action?

What to take away

Error budget = 1 - SLO target, expressed as failed requests allowed over the window.
Burn rate = how fast you are consuming it, relative to 1× (natural spend).
Alert on burn rate, not raw error rate. Use multi-window, multi-burn-rate patterns: e.g. page at 14× over 1h AND 5m; ticket at 3× over 24h.
Have a policy for what happens at different budget levels. Enforce it.
The budget makes reliability vs velocity a visible tradeoff. It is a negotiation tool, not a scoreboard.
Occasionally running the budget low is healthy. Always being at 90% remaining means you could ship faster.

Next module: the practical tools: dashboards, alerts, and the metrics-logs-traces workflow in action.

SLO: What You Commit To

Continue

Grafana Dashboards That Do Not Suck

←→ navigateM toggle sidebar