The Four Golden Signals
A new team takes over a service. They inherit 80 metrics, 30 dashboards, and 40 alerts. On-call rotations are miserable; most alerts are ignored. The new tech lead deletes everything and starts from scratch with four metrics: latency, traffic, errors, saturation. Within a month, on-call load is down 70 percent, incidents are detected faster, and the dashboards are readable. Google's Four Golden Signals — from the SRE book — have become the default answer to "what should I measure for a service?" because they focus on what users actually experience and what operators actually need.
This lesson covers each signal, how to measure it correctly, and the common mistakes that turn a golden-signal dashboard into noise. These four are the starting point for every service you operate. Master them before adding anything else.
The Four Signals
| Signal | What it measures | Example metric |
|---|---|---|
| Latency | How long a request takes | p50/p95/p99 HTTP request duration |
| Traffic | How much demand the system is experiencing | requests per second |
| Errors | Rate of failing requests | 5xx error rate, percentage failed |
| Saturation | How full the system is | CPU %, queue depth, memory used |
Any service you operate should have a dashboard answering these four questions within the first 5 seconds of looking at it. Everything else is secondary.
The Four Golden Signals come from Google's SRE book and are the single most-cited observability framework. They focus on user-facing impact (latency, errors) and operator-actionable signals (traffic, saturation). For most services, if you cover these four well and alert on the first two (symptom alerts), you have the minimum viable observability. Everything else — custom metrics, specific dimensions, deep instrumentation — is additive.
Latency
How long does a request take? Measured in milliseconds or seconds.
Percentiles, not averages
Average latency is almost always the wrong metric. Averages hide long tails. The user experience is dominated by the slow requests, not the mean.
Use percentiles:
- p50 (median): half of requests are faster than this. A sanity check — if p50 is high, everything is slow.
- p95: 95% of requests are faster than this. The "typical worst-case" experience.
- p99: 99% of requests are faster than this. Heavy tail; where most customer complaints originate.
- p999: one in a thousand. Truly rare, but important for large-volume services.
A service with mean = 150 ms might have p99 = 2000 ms — 1% of users wait 2 seconds. The mean hid the problem.
# Prometheus PromQL: p99 latency over 5 minutes
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{service="api"}[5m]))
# p95 per endpoint
histogram_quantile(0.95,
sum by (endpoint, le) (
rate(http_request_duration_seconds_bucket{service="api"}[5m])
)
)
Separate successful and failed request latency
A service that fails fast (100ms 5xx response) looks "faster" than one that succeeds slowly. But failures are bad. Split latency by status code:
# Latency of SUCCESSFUL requests only
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{status=~"2..|3.."}[5m]))
Alert on successful-request latency. Failure latency is a different signal (errors, not latency).
Traffic
How much demand is the system handling?
For HTTP services: requests per second. For workers: jobs per second. For databases: queries per second. For pub/sub: messages per second.
# Requests per second, summed across instances
sum(rate(http_requests_total{service="api"}[5m]))
# Per endpoint
sum by (endpoint) (
rate(http_requests_total{service="api"}[5m])
)
Why traffic matters
Traffic by itself is not an "alert when high" signal — high traffic is business success. But it is the denominator for error rate ("percentage of requests that fail") and the contextual signal for capacity planning.
Traffic patterns also reveal correlations:
- "Latency spiked — did traffic spike too?" → capacity issue, not code.
- "Latency spiked but traffic is steady?" → code or dependency issue.
- "Traffic dropped suddenly?" → front-end break, upstream outage, or DDoS mitigation kicking in.
Alert on traffic drops
Sudden traffic drops often mean "something is broken upstream and nobody is reaching us." Alert on rate(http_requests_total) < historical_baseline:
# Traffic dropped > 50% vs 1 hour ago
sum(rate(http_requests_total[5m]))
< 0.5 * sum(rate(http_requests_total[5m] offset 1h))
Gotcha: this false-fires at night or weekends. Use anomaly detection or day-of-week-aware baselines for production.
Errors
Rate of failing requests.
# Raw error count per second
sum(rate(http_requests_total{status=~"5.."}[5m]))
# As a RATIO (errors / total) — much better
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Alert on ratio, not absolute
10 errors per second is bad for a service handling 100 requests per second (10% error rate). 10 errors per second is meaningless for a service handling 100,000 requests per second (0.01% error rate).
Alert on ratio or rate as a fraction of total, with thresholds aligned to your SLO (Module 5). A typical alert:
"Error rate over 5 minutes > 1% AND total request rate > 10/s"
The second condition avoids false alarms on near-zero traffic (a single failed healthcheck when everything else is idle).
What counts as an "error"?
Not just HTTP 5xx. Depending on the service:
- HTTP 5xx: server errors. Always an error.
- HTTP 4xx: client errors. Not your fault, but sometimes a signal (4xx spike = API contract change misuse).
- Policy violations: 429 rate-limited, 403 forbidden. Expected; not usually errors.
- Custom business errors: "order validation failed" — you decide.
- Timeouts: yes, errors.
- Partial failures: e.g., returning cached data because the DB is slow. "Degraded" success — track as its own signal.
Be explicit about what counts in each service. A service_errors_total counter with a type label lets you break down later.
Saturation
How close to full is the system?
Saturation is a resource-level signal: CPU, memory, disk I/O, network bandwidth, queue depth, connection pool usage. Unlike traffic (a request-rate measurement), saturation is "how much capacity is consumed?"
Measure saturation as a percentage
# CPU saturation (cgroup-aware for containers)
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
/
sum by (pod) (kube_pod_container_resource_limits{resource="cpu"})
# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes
# Queue depth (for worker pools, message brokers)
queue_depth{queue="ingest"}
# Connection pool usage
db_pool_connections_active / db_pool_connections_max
Not every service has one obvious saturation signal — you have to know the resource that constrains YOUR service. For a web service, it might be:
- CPU (for compute-heavy workloads).
- Thread/worker pool (for synchronous Python/Ruby apps with a fixed worker count).
- DB connection pool (for most database-backed apps).
- Memory (for caching services).
Saturation predicts problems
Saturation alerts earlier than latency or errors. When CPU hits 85%, latency is still OK but about to spike. An alert on saturation gives operators time to act (scale out, enable degradation) before users feel it.
But saturation alerts are not customer-facing — a saturated system is not necessarily broken. For true SLO-grade alerts, use latency/errors. Use saturation for capacity-planning signals.
Applying the Signals to Different Service Types
The Four Golden Signals work across service types, but the specifics vary:
| Service type | Latency | Traffic | Errors | Saturation |
|---|---|---|---|---|
| HTTP API | p95/p99 request duration | req/s | 5xx ratio | CPU / worker pool |
| Worker / job runner | job duration | jobs/s | failed-jobs ratio | queue depth |
| Database | query duration | queries/s | failed queries | connections used / IO await |
| Message queue | end-to-end lag | msg/s | dead-letter rate | queue size |
| Cache | hit latency | req/s | errors / misses | memory used |
| Load balancer | upstream latency | conn/s | 5xx | connection count |
For a new service, filling out this table is the "golden signals for my service" exercise. Do it once; set up the dashboard; alert on the first two.
The Four Signals on One Dashboard
Golden Signals vs the USE Method
You may also encounter the USE method (Utilization, Saturation, Errors), from Brendan Gregg's Linux performance work. It is the resource-side equivalent of golden signals:
| Framework | Focus | Best for |
|---|---|---|
| Four Golden Signals | Service-side (user-facing) | Web services, APIs, workers |
| USE (Utilization, Saturation, Errors) | Resource-side (host-side) | CPU, disk, network, kernel resources |
| RED (Rate, Errors, Duration) | Request-level | Microservices (Prometheus community's variant) |
Most teams use a blend: golden signals for service health, USE for host-level bottlenecks, RED per-service. They are not competing — they are complementary views.
Symptom Alerts vs Cause Alerts
A critical nuance. Your alerts should target symptoms (user-facing issues) not causes (machine-level metrics).
- Symptom alert (good): "Error rate for checkout > 1% for 10 minutes." This is what users experience.
- Cause alert (bad): "CPU > 90% on pod-abc123." This may or may not matter. Page engineers only if users are affected.
Golden signals are symptoms. Saturation is the gray zone — saturation can be a leading indicator of a symptom, but it is not one itself. Alert on latency + error ratio; keep saturation as a dashboard signal and a predictive hint.
Ratio over count, always. An alert on "5xx errors > 100/s" fires when traffic doubles even if the error RATE stays constant. An alert on "5xx / total > 1%" stays correct regardless of traffic. Almost every alert should be normalized this way.
Common Traps
Averages for latency
Already covered — always use percentiles (p95/p99), never means.
Latency measured client-side vs server-side
If your service uses a load balancer or proxy, server-side latency misses queue wait time. Client-facing latency (real user monitoring, RUM, or from your edge) is more accurate. For pure server health, both matter.
Counting 2xx only for "success rate"
Some clients need 3xx redirects. Some business logic returns 2xx with a "status": "failed" body. Either case: your HTTP status codes lie about success. Define success explicitly:
# Count business-level errors too
sum(rate(http_requests_total{status!~"2..|3.."}[5m]))
+
sum(rate(application_errors_total[5m]))
Conflating saturation levels
CPU at 50% on an 8-core container can mean:
- Workload uses 4 cores evenly: fine.
- Workload is single-threaded and pegged at 100% of one core: already saturated.
Drill down by processes, threads, or cpu.stat (cgroup v2) to get the real picture. Raw aggregate % can mislead.
No saturation for queues
A worker service with a growing queue depth is saturated even if the workers themselves are idle. Always track queue depth as a separate saturation signal.
A Practical Implementation
Instrument HTTP handlers in almost any language with a Prometheus client library:
// Go (using prometheus/client_golang)
var (
requestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "http_requests_total",
}, []string{"method", "endpoint", "status"})
requestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10},
}, []string{"method", "endpoint", "status"})
)
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
rec := &statusRecorder{ResponseWriter: w, status: 200}
next.ServeHTTP(rec, r)
status := strconv.Itoa(rec.status)
requestsTotal.WithLabelValues(r.Method, r.URL.Path, status).Inc()
requestDuration.WithLabelValues(r.Method, r.URL.Path, status).
Observe(time.Since(start).Seconds())
})
}
That gives you traffic (the counter), latency (the histogram), and errors (status label). Saturation comes from runtime metrics (GC pause time, goroutine count) or the container runtime (cgroup stats).
Key Concepts Summary
- Four Golden Signals: Latency, Traffic, Errors, Saturation. Start here for any service.
- Latency should be percentiles, not averages. p50/p95/p99.
- Errors should be rates as a ratio of total, not absolute counts.
- Traffic is context — it divides errors and predicts saturation.
- Saturation is resource-specific — CPU, memory, worker pool, queue depth. Know your bottleneck.
- Separate success and failure latency — failures often respond fast, skewing numbers.
- Alert on symptoms (latency, errors), not causes (CPU spikes).
- USE method (Utilization, Saturation, Errors) is the resource-side complement; RED (Rate, Errors, Duration) is the request-side variant.
- A golden-signals dashboard is 4 panels. Every service should have one.
Common Mistakes
- Using average latency. Almost always misleading.
- Alerting on absolute error counts. Breaks at high or low traffic.
- Not splitting latency by success vs failure. Failure fast-fails skew "successful" metrics.
- Ignoring traffic drops. They often precede incidents (upstream broken).
- Saturation alerts paging engineers. Saturation is predictive, not a page; wait for symptoms.
- Measuring only what your service does, ignoring the resources (worker pool, queue depth) that constrain it.
- One dashboard with 80 panels. Use a 4-panel golden-signals dashboard as the "at-a-glance" view; detail panels go below.
- Treating 4xx and 5xx the same. 4xx is client error; 5xx is server error. Only 5xx usually goes in your error-rate metric.
- Forgetting business errors. HTTP 200 does not always mean success. Instrument domain-level success/failure too.
- Skipping the exercise of writing down the four signals per service. A half-hour whiteboard session pays back across every incident for years.
Your teams API runs on 20 pods with an average CPU of 30 percent per pod. A dashboard shows p50 latency at 80ms and mean latency at 180ms. An engineer argues the system is healthy — avg latency is reasonable and CPU has headroom. A user complains about a 3-second page load. What are you missing from the four-golden-signals view, and what is probably happening?