Observability Fundamentals for Engineers

Alert Design

Good alerting is the single biggest quality-of-life lever in an on-call rotation. Bad alerting: noisy, wrong-priority, poorly targeted, is why engineers burn out and leave teams.

The worst alerting setups share the same pathology: hundreds of alerts, most of them firing regularly, most of them irrelevant. Engineers learn to ignore alerts. Then the one alert that matters gets ignored too.

This lesson is about how to design alerts that actually page you when things matter and stay quiet when they don't.

KEY CONCEPT

Every page should be actionable, urgent, and real. If any of those three is missing, the alert is either misconfigured or should not exist as a page. The right page count per week per service is small, single digits.

The only good reason to page someone

An alert should page a human being if and only if:

Something is happening now that affects users.
A human is required to fix it (not a retry, not autoscaling, not a self-healing mechanism).
It is urgent: waiting until morning would materially worsen the outcome.

If any of these is missing, you have either a ticket (urgent but not page-worthy) or a noise problem.

The page-worthy failure modes

Service is down or severely degraded (SLO burning fast).
Data is being lost or corrupted.
A security incident is in progress.
A dependency has failed in a way the service cannot auto-recover from.

That is approximately the whole list.

Alert on symptoms, not causes

The most important rule in alerting:

KEY CONCEPT

Alert on what the user sees, not on what is happening inside your system. User-visible errors. User-visible latency. Data that should be current but isn't. Alerts on internal causes (CPU, memory, queue depth) generate noise because they fire even when the system is handling the issue.

The failure mode of cause-based alerts

CPU hits 95%. You page the on-call.
On-call wakes up. Looks at the dashboard. Service is serving traffic fine. Latency is normal. Error rate is zero.
On-call goes back to sleep, annoyed.
The alert is tuned up (threshold 98%, or 5-minute window). Then it fires again, or misses the time it mattered.

CPU being high is a cause. Sometimes it leads to a user-visible problem and sometimes it does not. Paging on it means you page whenever the cause could, theoretically, be relevant.

Symptom-based alerts

Error rate for the service > 1% for 5+ minutes → user requests are failing → page.
p99 latency > 1s for 10+ minutes → users are seeing slow responses → page.
Pipeline lag > 30 minutes → freshness SLO violated → page.

These fire when users are actually affected. CPU can be at 95% all day and not trigger. High queue depth can happen and not trigger, as long as users are getting served.

The taxonomy: page, ticket, nothing

Every potential alert falls into one of three buckets:

The mistake most teams make is putting everything in the PAGE column "just in case." The fix is to deliberately categorize each alert and move most of them down a tier.

Multi-window, multi-burn-rate alerts (revisited)

From the SLO module, the standard production pattern:

- alert: SLOFastBurn
  expr: |
    (
      (1 - availability_sli_1h) / (1 - 0.999) > 14
      AND
      (1 - availability_sli_5m) / (1 - 0.999) > 14
    )
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "Fast burn of availability SLO for {{ $labels.service }}"
    runbook: "https://runbooks.company.com/slo-burn"

- alert: SLOSlowBurn
  expr: |
    (
      (1 - availability_sli_6h) / (1 - 0.999) > 6
      AND
      (1 - availability_sli_30m) / (1 - 0.999) > 6
    )
  for: 15m
  labels:
    severity: page
  annotations:
    summary: "Sustained burn of availability SLO for {{ $labels.service }}"

- alert: SLOTicketBurn
  expr: |
    (
      (1 - availability_sli_24h) / (1 - 0.999) > 3
      AND
      (1 - availability_sli_2h) / (1 - 0.999) > 3
    )
  for: 1h
  labels:
    severity: ticket

Three alerts per SLO: one for acute outages, one for sustained degradation, one for long-tail drain. That is the whole alerting surface for the availability SLO. Do the same for latency. That is ~6 alerts per service, total, for the SLOs.

PRO TIP

Tools like Pyrra, Sloth, and OpenSLO generate these rules automatically from a simple SLO definition. Do not write them by hand unless you understand exactly what they do, the math is subtle and getting it wrong produces either missed pages or endless noise.

The alert anatomy: what every alert needs

An alert that pages you should arrive with enough information to act. Minimum payload:

SEVERITY: page / ticket
SERVICE: orders-service
SUMMARY: Error rate is 5% (threshold 1%) for 3 minutes
DASHBOARD: <link to the overview dashboard>
RUNBOOK: <link to the specific runbook for this alert>
RECENT DEPLOYS: <automated query of deploys in last 30m>
RELATED TRACES: <link to tail-sampled slow/error traces>

The dashboard link, the runbook link, and "recent deploys" are the three fields that save the most time. An on-call engineer should be able to start diagnosing within 30 seconds of the page, not 3 minutes after they locate everything.

The runbook matters

Every paging alert should have a runbook. The runbook does not have to be long, one page, with:

What this alert means (in plain language).
What the usual causes are.
What to check first (linked dashboard panels).
What to do (remediation steps, commands, escalations).
When to escalate.

If there is no runbook, the alert will get worse in your hands. People will misremember what it means, and the same investigation will repeat on every page.

Avoiding alert fatigue

If you are paged more than a few times a week per service, the alerts are tuned wrong. Fatigue sets in. People stop reading pages carefully. The signal disappears.

Fixes:

1. Audit the last 90 days of pages

For each page, ask: "was this actionable? was it urgent? was it real?" If the answer to any is no, the alert is either miscategorized or should not exist.

2. Merge redundant alerts

"API error rate" and "API 5xx rate" and "API 500 rate" are three versions of the same alert. Merge them into one.

3. Suppress during known-maintenance

Planned work should not page. Your alerting tool should let you silence the relevant alerts during a maintenance window.

4. Add `for:` durations

A 30-second blip is often self-healing. Add for: 5m to any alert where a brief spike is normal. Do not set it so long that slow disasters slip through.

5. Group by service / cluster

If 50 alerts fire at once because one cluster is in trouble, collapse them into one notification. Alertmanager's group_by does this.

WARNING

Do not set "for:" so long that it becomes cover. for: 30m on an SLO burn alert means you will not know about an outage for 30 minutes. The SLO burn rate itself already bakes in some duration; do not stack another 30 minutes on top.

Special alerts worth having

Beyond SLO burn, a few domain alerts are universally useful:

1. Absence alerts: "metric stopped reporting"

absent_over_time(up{job="orders-service"}[5m])

If the service stops scraping entirely, every other alert is silent too, because they have no data to fire on. Catch this explicitly.

2. Cert expiry

(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14

14 days out, create a ticket. 3 days out, page.

3. Replication / failover lag

For stateful systems, replication lag is worth alerting on independently of SLO burn, because it signals risk of data loss during a failover.

4. Disk / storage

Not on utilization (cause) but on projected time to full (symptom of how much time you have to act).

predict_linear(disk_free_bytes[6h], 3600*24) < 0

Translation: at the current rate of fill, disk will be full within 24 hours.

Anti-patterns in alerting

1. Alerting on everything

Every metric is not an alert. Most metrics are dashboards. A few are alerts.

2. "The same alert but with a higher threshold"

If an alert fires noisily, the first instinct is to raise the threshold. This makes it less noisy but also less useful. The right question is usually: "is this the right alert at all?"

3. Alerting on derivatives

"Request rate dropped 20% in the last 5 minutes" sounds like a good alert. In practice, request rates drop 20% for ten non-outage reasons: marketing campaign ended, a large customer turned off a feature, a cron job finished. Derivative alerts have terrible precision.

4. Alerts with no owner

Every alert should be owned by a service team. If an alert fires and no one knows whose problem it is, it should not exist.

5. Alerts without runbooks

See above. Undocumented alerts rot.

Mini runbook template

Every alert should have one of these:

# Alert: SLOFastBurn — orders-service availability

## What it means
The error rate on orders-service is burning the availability SLO at 14x the normal rate.
If it continues, we will exhaust the 28-day error budget in 2 days.

## Usual causes
- Recent deploy introduced a regression (check annotations on the overview dashboard)
- Downstream dependency is failing (check orders-service -> database + recommendation-service traces)
- Config change affecting a specific path
- Traffic anomaly (unusual spike on a specific endpoint)

## First checks (< 2 minutes)
1. Open overview dashboard: https://grafana.internal/d/orders-overview
2. Look at deploy annotations in the last 30 minutes
3. Look at error rate by endpoint: which routes are failing?
4. Look at error rate by status: is it 500, 502, 503?

## Remediation
- If a deploy in the last 30 minutes correlates: roll back.
- If a dependency is failing: page that team or use its fallback path.
- If a specific endpoint is failing: feature-flag it off if possible.

## Escalation
- If unresolved in 15 minutes, escalate to team lead.
- If data-loss is suspected, escalate to incident commander immediately.

One page. Saves hours of on-call stress.

Alert testing

You cannot be sure an alert works until it has fired at least once. Two strategies:

1. Chaos-test your alerts

Intentionally induce the failure mode. Artificial 5xx, induced latency, CPU pressure. Confirm the alert fires and routes correctly.

2. Sanity-check routing

Once a quarter, send a synthetic test alert per rotation. Confirm it reaches the right people, the paging tool is working, and the runbook link is not 404.