Observability Fundamentals for Engineers

Grafana Dashboards That Do Not Suck

Every team builds too many dashboards. Most of those dashboards get used exactly twice: once by the engineer who built them, and once by whoever inherits the service and cannot delete them because "maybe someone still looks at it."

The worst offenders are the 50-panel "everything about the service" dashboards that try to show every signal and end up showing no useful signal at all. During an incident, on-call engineers scroll past them to get to the three panels they actually trust.

This lesson is about the small number of dashboards that do get used, and the design rules that make them work.

KEY CONCEPT

A dashboard has exactly one purpose: to answer a specific question. If you cannot write that question in one sentence, the dashboard is not designed — it is accumulated. Accumulated dashboards do not work during incidents.


The three dashboards every service needs

For any production service, three dashboards will cover 95% of usage:

1. OVERVIEWIs the service OK right now?SLO compliance (gauge)Request rate (graph)Error rate (graph)p95 / p99 latency (graph)Saturation (CPU, memory)5-8 panels maxReviewed at a glanceFirst thing to checkwhen you get paged2. DETAILIf not OK, where is the problem?Rate / errors by endpointLatency by endpointDownstream dep healthDB query statsCache hit / miss10-15 panelsOrganized by subsystemLinked from overviewwhere anomaly is seen3. INTERNALWhat is the service doing?Goroutine / thread countHeap size / GC pausesConnection poolsQueue depthsCron / job runs15-25 panelsUsed for deep debuggingService-owners use itNot for incident on-call

These three are the dashboard inventory. Everything else — "capacity planning," "per-customer," "experimentation" — is a special-case dashboard that lives outside this core.


The "overview dashboard" rules

The overview dashboard is the one you look at first when paged. It has to answer "is the service OK?" in 10 seconds.

Rules:

  1. Fits on one screen without scrolling. If you have to scroll, the dashboard is wrong.
  2. The SLO is always the top-left panel. Anchor the first thing the eye sees.
  3. Every panel answers a single question. Never put two unrelated series on one graph.
  4. Time range defaults to last 1-6 hours. Incident time-scale, not capacity-planning time-scale.
  5. Every panel links to a more detailed view (the detail dashboard, a log query, a trace search).

The canonical overview layout:

┌────────────────────────┬────────────────────┐
│ SLO compliance        │ Error budget left   │
│ 99.93% / 99.9%        │ 58% of 28-day       │
├────────────────────────┼────────────────────┤
│ Request rate (1h)     │ Error rate (1h)     │
├────────────────────────┼────────────────────┤
│ p95 latency (1h)      │ p99 latency (1h)    │
├────────────────────────┴────────────────────┤
│ Pod saturation (CPU + memory)               │
└──────────────────────────────────────────────┘

Six panels. Three columns of two. One answer: "yes the service is fine" or "no it's not, here's where to look next."


What NOT to put on a dashboard

Things engineers keep adding that make dashboards worse:

1. Every alert's underlying metric

If you have 40 alerts, do not add 40 panels. The alerts themselves are the signal. The dashboard should show the 5-8 overall health indicators, not every granular detail.

2. "Interesting" metrics nobody uses

"Number of requests by HTTP method" is interesting exactly once. After that it is visual noise. If you would not look at this panel in an incident, cut it.

3. Raw counters

http_requests_total

A counter climbs forever. Looking at the raw value tells you nothing. Always show rate().

4. Averages when you mean percentiles

# BAD — masks the slow tail
avg(http_request_duration_seconds)

# GOOD — shows the actual user experience
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

5. Ambiguous axes

If a panel shows "errors per hour" and another shows "errors per second," someone will mis-compare them at 3am. Be consistent — rate(...[5m]) everywhere unless there is a very specific reason not to.


The panel-design rules

1. Title is a question or fact

# BAD
"HTTP requests"

# GOOD
"Request rate by endpoint"

The title tells the reader what they are looking at in 3 seconds.

2. Units are set explicitly

Grafana will happily show you a y-axis labeled "1.2M" without telling you if that is bytes, requests, or seconds. Always set the unit. Always.

3. Thresholds are visible

If there is a threshold that matters ("p99 < 500ms is the SLI target"), draw a horizontal line on the graph. Makes the SLO tangible.

4. Legend is minimal

If your legend has 50 entries, the graph is useless. Use topk(5, ...) or aggregate. If you genuinely need 50 entries, it is a table, not a graph.

5. Color is semantic

Red = bad. Yellow = warning. Green = good. Gray = neutral / informational. Do not use rainbow palettes on error panels. Do not paint everything red to make it "look alarming."

PRO TIP

Grafana's default color palette is designed for exploration, not incidents. Pick 3-4 colors for your team and use them consistently across dashboards.


Using recording rules and variables

Two Grafana tools that dramatically improve dashboards:

Recording rules for expensive queries

Any query with histogram_quantile across many labels belongs in a recording rule. Dashboards should be fast to load. If a dashboard takes 5 seconds to render, engineers stop looking at it.

# Prometheus recording rule
groups:
  - name: slo_http_p99
    interval: 30s
    rules:
      - record: job:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_request_duration_seconds_bucket[5m]))
          )

Dashboard query becomes job:http_request_duration_seconds:p99 — fast and readable.

Template variables for re-usable dashboards

$service — from label_values(http_requests_total, service)
$environment — from label_values(http_requests_total{service=$service}, environment)

Now the same dashboard works for every service. You pick the service from a dropdown instead of having one dashboard per service.


USE and RED as a starting template

For any service, use these quick-start templates:

RED for request-driven services

  • Rate: sum(rate(http_requests_total[5m]))
  • Errors: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
  • Duration: histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Three panels; covers the core of any service dashboard.

USE for resources (queues, pools, workers)

  • Utilization: fraction of resource in use
  • Saturation: queue depth / wait time
  • Errors: failures attributable to resource exhaustion

Three panels for each resource.

Start with RED + USE. Expand only when there is a specific question you keep needing to answer that neither covers.


Annotations — the under-used feature

Annotations put vertical lines on graphs showing when things happened. Good annotation sources:

  • Deploys — from your CD pipeline, automatically.
  • Feature flag changes — if you use LaunchDarkly / similar, emit an annotation on flag flip.
  • Infra events — new node added, autoscaling event, cluster upgrade.
  • Incidents — when the oncall tool fires.
# Grafana annotation query
SELECT event_time AS "time", event_type, details
FROM deploy_events
WHERE $__timeFilter(event_time) AND service = '$service'

When you see a latency spike, it is immediately obvious whether it correlates with a deploy. That one feature saves hours of investigation.

KEY CONCEPT

Every dashboard should have deploy annotations. It is the cheapest possible piece of context, and it answers the first question ("did we just deploy something?") before you have to ask it.


Dashboards as code

Click-built dashboards drift. Teams end up with 10 copies of the same panel with subtle differences, and no one knows which is the "real" one.

The fix: dashboards as code.

  • Grafana Terraform provider — declare dashboards in HCL, check into git, review via PR.
  • Grafonnet (Jsonnet) — write dashboards as functions; generate JSON.
  • grafanalib (Python) — similar, in Python.
  • Just store the JSON in git and sync it via a CI job — the lowest-effort version.

Whichever tool: the principle is that the dashboard source of truth is in git, not in Grafana. The Grafana database is a read replica.


The dashboard review

Every quarter, a 30-minute review:

  1. Which dashboards actually got viewed? Grafana tracks this. Dashboards not viewed in 90 days are candidates for deletion.
  2. Are the top 3 dashboards still accurate? Did anything move, rename, or churn?
  3. Are there dashboards that keep being used but feel clunky? Fix them.
  4. Are there alerts pointing to dashboards that no longer exist? Fix those too.

Dashboards rot faster than code. Without a review cadence, you end up with 200 dashboards, 10 of which are useful, and no one knows which 10.


Quiz

KNOWLEDGE CHECK

You are on call. You get paged at 2am for a latency alert on the orders-service. You open the orders-service overview dashboard. What should the dashboard let you determine in the first 10 seconds?


What to take away

  • Three dashboards per service cover 95% of usage: overview, detail, internal.
  • The overview dashboard must fit on one screen and answer "is the service OK?" in 10 seconds.
  • Panels show rates, not raw counters. Show percentiles, not averages. Set units and thresholds explicitly.
  • Use recording rules for expensive queries so dashboards load fast.
  • Use template variables so one dashboard covers many services.
  • Deploy annotations are the cheapest piece of debugging context you can add.
  • Dashboards as code. Quarterly review. Delete unused ones.

Next lesson: alert design — writing alerts that page you for real problems and stay quiet the rest of the time.