Observability Fundamentals for Engineers

Debugging with Observability

This is the capstone lesson. Everything we have built: metrics, logs, traces, SLOs, dashboards, alerts, comes together during an actual incident. The workflow is not theoretical: it is the concrete sequence of "click here, then here, then here" that gets from "pager just fired" to "fix deployed."

A team that has nailed this workflow resolves an incident in 10-15 minutes. A team that hasn't spends 2-4 hours on the same incident. The gap is rarely raw engineering skill, it's the systems and the muscle memory of knowing which tool answers which question.

KEY CONCEPT

The debugging flow is metrics → traces → logs. In that order. Metrics tell you something is wrong. Traces tell you where the time went. Logs tell you what went wrong in that specific place. Going out of order wastes time.

The flow

A walkthrough: elevated p99 latency

Let's walk through a real-shape incident end to end.

14:42: the page

Severity: page
Service: orders-service
Summary: Fast burn of availability SLO for orders-service
  current error rate: 4.2% (threshold 1%)
Dashboard: https://grafana.internal/d/orders-overview
Runbook: https://runbooks.internal/orders/slo-burn

You wake up, open the dashboard link.

14:43: metrics phase: scope the problem

On the overview dashboard you see:

SLO compliance: dropped from 99.95% to 98.3%.
Error rate: climbed from ~0.1% to 4.2% starting at 14:38.
Request rate: unchanged (no traffic spike).
p99 latency: climbed from 450ms to 2.1s.
Pod count: normal.
Deploy annotation: no deploy in the last 2 hours.

Two minutes to answer: something happened at 14:38, affecting error rate AND latency, not caused by a deploy or traffic spike.

This narrows the universe. We know:

It is not a recent code change (no deploy).
It is not load (no traffic spike).
It is not a scaling event (pod count normal).
It is user-visible (SLO burn + latency climb).

14:44: metrics phase: localize

On the detail dashboard:

Errors by endpoint: the spike is 100% on POST /orders/submit. Every other endpoint is healthy.
Errors by status: all 502s. No 500s, no 503s.
Latency by downstream dep: latency to inventory-service jumped from 80ms to 1.9s.

Answer: POST /orders/submit is slow, returning 502s, and spending all the added time talking to inventory-service.

14:46: trace phase: find the bottleneck

Open the tracing tool. Search:

service = orders-service
span.name = POST /orders/submit
duration > 1s
time: last 10 minutes

Pull up a typical slow trace. The span tree shows:

POST /orders/submit                                   2,050 ms
├── auth check                                          45 ms
├── cart validation                                     80 ms
├── inventory-service.reserve                       1,800 ms  ← here
│   └── postgres.SELECT FROM inventory                  12 ms
│   └── redis.GET cart_reservations                     8 ms
│   └── external_wms.reserve                        1,770 ms  ← and here
│       └── HTTP 502 after timeout
├── payments-service.authorize                          45 ms
└── db.INSERT INTO orders                               80 ms

Answer: inventory-service is failing its outbound call to external_wms. The WMS is timing out, returning 502s to inventory-service, which propagates 502s to orders-service.

Now we know which service to blame, which external dependency is the problem, and what the failure mode is.

14:47: logs phase: confirm the root cause

In the log tool, filter:

service = inventory-service
trace_id = <trace_id from the slow trace>

You see:

{"ts":"14:38:02","level":"warn","msg":"external_wms call timed out","url":"wms.vendor.com/reserve","duration_ms":1800}
{"ts":"14:38:02","level":"error","msg":"inventory reservation failed","err":"dial tcp 203.0.113.42:443: i/o timeout"}

Search logs without the trace_id filter, just the service and time range:

service = inventory-service
message = "external_wms call timed out"
time: last 30 min

~1,200 matches, all since 14:38. Before 14:38, near zero.

Answer: the external WMS endpoint wms.vendor.com started timing out at 14:38. inventory-service cannot reach it. Nothing on our side changed.

Total elapsed time: ~5 minutes from page to root cause.

14:48: respond

Runbook step for "external dep failing":

Notify the vendor (they probably already know).
Enable fallback path: the inventory-service has a "best-effort" flag that skips the external WMS and reserves optimistically.
Feature-flag the checkout path to use that flag.

A minute later, error rate drops back to 0.2%. p99 latency drops back to 500ms. SLO burn rate drops to 1×.

14:50: post-incident

Create a ticket: "evaluate retry/circuit-breaker for external_wms so we fall back automatically." That becomes Monday morning's first task.

The pattern is always the same

Strip the specifics out and the pattern is:

Alert fires. The alert is already symptom-based (SLO burn), so you know it is real and user-visible.
Metrics localize the scope. Which service, which endpoint, which status, when did it start.
Traces localize within the service. Which span is taking the time, which downstream is failing.
Logs give the root cause. The exact error, from the exact service, at the exact time.
Runbook has the remediation. Or you make one up and write it for next time.
Postmortem + ticket for the structural fix.

The flow works on every incident I have worked. Latency. Errors. Data issues. Capacity problems. The tools change; the flow doesn't.

What goes wrong when the flow breaks

The flow requires all three pillars be instrumented and connected. Common failure modes:

Metrics exist but are not granular enough

"Error rate is up" but there is no breakdown by endpoint, status, or downstream. You cannot localize. Back to grepping.

Fix: Go back and add the RED metrics with appropriate labels (route, status, method, upstream).

Traces exist but don't cover the failing code path

You pull up slow traces but they are missing the span for the external call. You cannot see where the time went.

Fix: Audit auto-instrumentation coverage. Add manual spans for the business operations that matter.

Logs exist but have no `trace_id`

You find the slow trace but cannot pivot to logs, too many log lines, no way to filter to just this request.

Fix: Get trace_id into every log line emitted during a request. It is the single biggest cross-pillar lever.

Everything exists but lives in three different tools

You have metrics in Prometheus, traces in Jaeger, logs in Elasticsearch. Each tool has its own time range picker, its own search syntax, its own UI. The pivot takes minutes each.

Fix: Use a tool suite that links them (Grafana + Loki + Tempo + Prometheus; or Datadog APM + Logs + Metrics; or Honeycomb for traces + metrics). The friction reduction is enormous.

KEY CONCEPT

One unified pivot UI across the three pillars is worth more than any individual tool's features. A 30-second pivot from trace to logs, compared to a 3-minute one, is the difference between a 10-minute incident and a 30-minute one.

Common debugging scenarios

Scenario 1: "p99 is up, p50 is fine"

This is a tail latency problem. Something is slow for some requests but not most. Candidates:

Cache miss path being exercised more than usual.
Per-user issue (specific customer's data is slow).
Per-tenant issue (DB query slow on a large tenant's table).
GC pause on a specific pod.
A specific downstream is slow some of the time.

Tool: traces filtered to duration > p99 threshold. Compare slow traces to fast traces. The difference is usually obvious.

Scenario 2: "error rate is up, latency is fine"

Something is failing fast. Usually the service is saying "no" before doing the work.

Auth is rejecting requests.
A circuit breaker is open.
A feature flag evaluated differently.
A validation is failing on new input.

Tool: logs filtered by level=error, grouped by error type. Pattern emerges.

Scenario 3: "latency is up, error rate is fine"

Requests are completing but slowly. Everything is succeeding, just taking longer. Usually a resource contention issue.

DB is slow (lock contention, slow query plan, replica lag).
Downstream is slow.
GC pauses.
Threadpool saturation.

Tool: traces. Look at slow traces; find which span grew.

Scenario 4: "something was fine, then it wasn't"

The classic "what changed?" question.

Tool: deploy annotations on dashboards. If you see a vertical line at the time of the change, the deploy is the suspect. If not, something external changed (dependency, traffic pattern, data).

The "N+1 query" debugging special case

One of the most common production-latency bugs: N+1 queries. An endpoint accidentally makes one DB query per result item instead of a single batched query.

Metrics often miss this, latency might be fine on small result sets and terrible on large ones. Traces are how you spot it:

GET /users/42/orders                                   3,400 ms
├── db.SELECT FROM users                                  8 ms
├── db.SELECT FROM orders WHERE user_id=42              12 ms  (returns 300 rows)
├── db.SELECT FROM items WHERE order_id=1               8 ms
├── db.SELECT FROM items WHERE order_id=2               8 ms
├── db.SELECT FROM items WHERE order_id=3               8 ms
... 300 more
├── db.SELECT FROM items WHERE order_id=300             8 ms

300 tiny sequential queries. 2.4 seconds of latency from ORM laziness. Traces make this one click away; metrics alone would never surface it.

The "correlated services" special case

Sometimes multiple services all spike at once. Which is the cause?

Your first suspect is a shared downstream: all the services are downstream of the same database, cache, or external API. If that shared dependency is slow, everyone upstream sees the same symptom.

Tool: traces. Find a slow trace in each service; compare the span trees. The common span near the bottom of the tree is the shared dependency.

War story: the cache stampede

Real incident: an e-commerce service started showing a 4× latency spike every morning at 8:00 sharp. Perfectly periodic, perfectly mysterious. Dashboards showed the spike but not the cause.

Traces told the story within two minutes:

At 8:00:00, cache keys were expiring in large numbers.
Every expired key triggered a fresh database query.
The database was seeing a 20× spike in QPS on the regenerate path.
The service latency reflected the DB saturation.

Root cause: cache TTLs were all being set at the same time (midnight UTC, 8-hour TTL → all expire at 8:00 UTC). A classic cache stampede.

Fix: jitter the TTL by ±10%. Problem gone.

Without traces, we would have stared at "QPS spike at 8am" for days. With traces it took 120 seconds.

KEY CONCEPT

The 10-minute incident vs 4-hour incident gap is not skill. It is instrumentation + workflow + muscle memory. Drill the flow on small incidents so you have it automatic during big ones.

Quiz

KNOWLEDGE CHECK

You get paged for a latency SLO burn. You open the overview dashboard and see p99 has climbed from 400ms to 1.8s. Error rate is normal. No recent deploys. What is your next move?

What to take away

The debugging flow is metrics → traces → logs, in that order. Each pillar narrows the search space before the next.
Metrics answer scope and shape: what, where, when.
Traces answer localization: which service, which span, which downstream.
Logs answer root cause: the exact error or state that caused the issue.
A broken flow (no trace_id in logs, traces not covering the relevant code path, metrics without the right labels) means back to grep. Fix the broken piece.
Deploy annotations on dashboards answer the first question ("did we just change something?") for free.
Drill the flow on small incidents. Speed under pressure is muscle memory, not genius.

Congratulations

You have finished the Observability Fundamentals course. You now have the working vocabulary and practical patterns for the third pillar of production engineering: the one that most engineers learn by accident, too slowly, and too painfully.

The tools and vendors will keep changing. The fundamentals in this course: the shape of a metric, the structure of a log, the purpose of a trace, the error budget as a decision tool, the metrics-traces-logs flow, will keep working for the next decade.

Build these into your day-to-day. The next time a production incident lands, you will find yourself walking the flow automatically. And the 10-minute resolution will feel normal.

Thanks for taking the course. If it was useful, the best thing you can do is tell another engineer about it.

Alert Design

Course Complete

←→ navigateM toggle sidebar

Debugging with Observability

The flow

A walkthrough: elevated p99 latency

14:42: the page

14:43: metrics phase: scope the problem

14:44: metrics phase: localize

14:46: trace phase: find the bottleneck

14:47: logs phase: confirm the root cause

14:48: respond

14:50: post-incident

The pattern is always the same

What goes wrong when the flow breaks

Metrics exist but are not granular enough

Traces exist but don't cover the failing code path

Logs exist but have no trace_id

Everything exists but lives in three different tools

Common debugging scenarios

Scenario 1: "p99 is up, p50 is fine"

Scenario 2: "error rate is up, latency is fine"

Scenario 3: "latency is up, error rate is fine"

Scenario 4: "something was fine, then it wasn't"

The "N+1 query" debugging special case

The "correlated services" special case

War story: the cache stampede

Quiz

What to take away

Congratulations

Logs exist but have no `trace_id`