Alert Design
Good alerting is the single biggest quality-of-life lever in an on-call rotation. Bad alerting — noisy, wrong-priority, poorly targeted — is why engineers burn out and leave teams.
The worst alerting setups share the same pathology: hundreds of alerts, most of them firing regularly, most of them irrelevant. Engineers learn to ignore alerts. Then the one alert that matters gets ignored too.
This lesson is about how to design alerts that actually page you when things matter and stay quiet when they don't.
Every page should be actionable, urgent, and real. If any of those three is missing, the alert is either misconfigured or should not exist as a page. The right page count per week per service is small — single digits.
The only good reason to page someone
An alert should page a human being if and only if:
- Something is happening now that affects users.
- A human is required to fix it (not a retry, not autoscaling, not a self-healing mechanism).
- It is urgent — waiting until morning would materially worsen the outcome.
If any of these is missing, you have either a ticket (urgent but not page-worthy) or a noise problem.
The page-worthy failure modes
- Service is down or severely degraded (SLO burning fast).
- Data is being lost or corrupted.
- A security incident is in progress.
- A dependency has failed in a way the service cannot auto-recover from.
That is approximately the whole list.
Alert on symptoms, not causes
The most important rule in alerting:
Alert on what the user sees, not on what is happening inside your system. User-visible errors. User-visible latency. Data that should be current but isn't. Alerts on internal causes (CPU, memory, queue depth) generate noise because they fire even when the system is handling the issue.
The failure mode of cause-based alerts
- CPU hits 95%. You page the on-call.
- On-call wakes up. Looks at the dashboard. Service is serving traffic fine. Latency is normal. Error rate is zero.
- On-call goes back to sleep, annoyed.
- The alert is tuned up (threshold 98%, or 5-minute window). Then it fires again, or misses the time it mattered.
CPU being high is a cause. Sometimes it leads to a user-visible problem and sometimes it does not. Paging on it means you page whenever the cause could, theoretically, be relevant.
Symptom-based alerts
- Error rate for the service > 1% for 5+ minutes → user requests are failing → page.
- p99 latency > 1s for 10+ minutes → users are seeing slow responses → page.
- Pipeline lag > 30 minutes → freshness SLO violated → page.
These fire when users are actually affected. CPU can be at 95% all day and not trigger. High queue depth can happen and not trigger — as long as users are getting served.
The taxonomy: page, ticket, nothing
Every potential alert falls into one of three buckets:
The mistake most teams make is putting everything in the PAGE column "just in case." The fix is to deliberately categorize each alert and move most of them down a tier.
Multi-window, multi-burn-rate alerts (revisited)
From the SLO module — the standard production pattern:
- alert: SLOFastBurn
expr: |
(
(1 - availability_sli_1h) / (1 - 0.999) > 14
AND
(1 - availability_sli_5m) / (1 - 0.999) > 14
)
for: 2m
labels:
severity: page
annotations:
summary: "Fast burn of availability SLO for {{ $labels.service }}"
runbook: "https://runbooks.company.com/slo-burn"
- alert: SLOSlowBurn
expr: |
(
(1 - availability_sli_6h) / (1 - 0.999) > 6
AND
(1 - availability_sli_30m) / (1 - 0.999) > 6
)
for: 15m
labels:
severity: page
annotations:
summary: "Sustained burn of availability SLO for {{ $labels.service }}"
- alert: SLOTicketBurn
expr: |
(
(1 - availability_sli_24h) / (1 - 0.999) > 3
AND
(1 - availability_sli_2h) / (1 - 0.999) > 3
)
for: 1h
labels:
severity: ticket
Three alerts per SLO: one for acute outages, one for sustained degradation, one for long-tail drain. That is the whole alerting surface for the availability SLO. Do the same for latency. That is ~6 alerts per service, total, for the SLOs.
Tools like Pyrra, Sloth, and OpenSLO generate these rules automatically from a simple SLO definition. Do not write them by hand unless you understand exactly what they do — the math is subtle and getting it wrong produces either missed pages or endless noise.
The alert anatomy — what every alert needs
An alert that pages you should arrive with enough information to act. Minimum payload:
SEVERITY: page / ticket
SERVICE: orders-service
SUMMARY: Error rate is 5% (threshold 1%) for 3 minutes
DASHBOARD: <link to the overview dashboard>
RUNBOOK: <link to the specific runbook for this alert>
RECENT DEPLOYS: <automated query of deploys in last 30m>
RELATED TRACES: <link to tail-sampled slow/error traces>
The dashboard link, the runbook link, and "recent deploys" are the three fields that save the most time. An on-call engineer should be able to start diagnosing within 30 seconds of the page, not 3 minutes after they locate everything.
The runbook matters
Every paging alert should have a runbook. The runbook does not have to be long — one page, with:
- What this alert means (in plain language).
- What the usual causes are.
- What to check first (linked dashboard panels).
- What to do (remediation steps, commands, escalations).
- When to escalate.
If there is no runbook, the alert will get worse in your hands. People will misremember what it means, and the same investigation will repeat on every page.
Avoiding alert fatigue
If you are paged more than a few times a week per service, the alerts are tuned wrong. Fatigue sets in. People stop reading pages carefully. The signal disappears.
Fixes:
1. Audit the last 90 days of pages
For each page, ask: "was this actionable? was it urgent? was it real?" If the answer to any is no, the alert is either miscategorized or should not exist.
2. Merge redundant alerts
"API error rate" and "API 5xx rate" and "API 500 rate" are three versions of the same alert. Merge them into one.
3. Suppress during known-maintenance
Planned work should not page. Your alerting tool should let you silence the relevant alerts during a maintenance window.
4. Add for: durations
A 30-second blip is often self-healing. Add for: 5m to any alert where a brief spike is normal. Do not set it so long that slow disasters slip through.
5. Group by service / cluster
If 50 alerts fire at once because one cluster is in trouble, collapse them into one notification. Alertmanager's group_by does this.
Do not set "for:" so long that it becomes cover. for: 30m on an SLO burn alert means you will not know about an outage for 30 minutes. The SLO burn rate itself already bakes in some duration; do not stack another 30 minutes on top.
Special alerts worth having
Beyond SLO burn, a few domain alerts are universally useful:
1. Absence alerts — "metric stopped reporting"
absent_over_time(up{job="orders-service"}[5m])
If the service stops scraping entirely, every other alert is silent too — because they have no data to fire on. Catch this explicitly.
2. Cert expiry
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
14 days out, create a ticket. 3 days out, page.
3. Replication / failover lag
For stateful systems, replication lag is worth alerting on independently of SLO burn, because it signals risk of data loss during a failover.
4. Disk / storage
Not on utilization (cause) but on projected time to full (symptom of how much time you have to act).
predict_linear(disk_free_bytes[6h], 3600*24) < 0
Translation: at the current rate of fill, disk will be full within 24 hours.
Anti-patterns in alerting
1. Alerting on everything
Every metric is not an alert. Most metrics are dashboards. A few are alerts.
2. "The same alert but with a higher threshold"
If an alert fires noisily, the first instinct is to raise the threshold. This makes it less noisy but also less useful. The right question is usually: "is this the right alert at all?"
3. Alerting on derivatives
"Request rate dropped 20% in the last 5 minutes" sounds like a good alert. In practice, request rates drop 20% for ten non-outage reasons — marketing campaign ended, a large customer turned off a feature, a cron job finished. Derivative alerts have terrible precision.
4. Alerts with no owner
Every alert should be owned by a service team. If an alert fires and no one knows whose problem it is, it should not exist.
5. Alerts without runbooks
See above. Undocumented alerts rot.
Mini runbook template
Every alert should have one of these:
# Alert: SLOFastBurn — orders-service availability
## What it means
The error rate on orders-service is burning the availability SLO at 14x the normal rate.
If it continues, we will exhaust the 28-day error budget in 2 days.
## Usual causes
- Recent deploy introduced a regression (check annotations on the overview dashboard)
- Downstream dependency is failing (check orders-service -> database + recommendation-service traces)
- Config change affecting a specific path
- Traffic anomaly (unusual spike on a specific endpoint)
## First checks (< 2 minutes)
1. Open overview dashboard: https://grafana.internal/d/orders-overview
2. Look at deploy annotations in the last 30 minutes
3. Look at error rate by endpoint: which routes are failing?
4. Look at error rate by status: is it 500, 502, 503?
## Remediation
- If a deploy in the last 30 minutes correlates: roll back.
- If a dependency is failing: page that team or use its fallback path.
- If a specific endpoint is failing: feature-flag it off if possible.
## Escalation
- If unresolved in 15 minutes, escalate to team lead.
- If data-loss is suspected, escalate to incident commander immediately.
One page. Saves hours of on-call stress.
Alert testing
You cannot be sure an alert works until it has fired at least once. Two strategies:
1. Chaos-test your alerts
Intentionally induce the failure mode. Artificial 5xx, induced latency, CPU pressure. Confirm the alert fires and routes correctly.
2. Sanity-check routing
Once a quarter, send a synthetic test alert per rotation. Confirm it reaches the right people, the paging tool is working, and the runbook link is not 404.
A dead alerting pipeline is worse than no alerting — because you think you are covered. Synthetic tests are cheap insurance.
Quiz
An engineer proposes adding an alert that fires whenever pod CPU usage exceeds 85 percent for 5 minutes. What is the best response?
What to take away
- Page only for user-visible, urgent, actionable problems. If any of those three is missing, downgrade or delete.
- Alert on symptoms (SLO burn, error rate, latency), not causes (CPU, memory, queue depth).
- Use multi-window, multi-burn-rate alerts for SLOs. Three alerts per SLO — fast-burn page, slow-burn page, long-tail ticket.
- Every paging alert needs a runbook with "what it means / usual causes / what to check / how to remediate."
- Target: fewer than 5 pages per week per service. More than that is alert fatigue.
- Merge redundant alerts. Group correlated alerts. Silence maintenance windows. Test alert routing quarterly.
Next lesson: putting it all together — the metrics → logs → traces debugging flow during a real incident.