Observability Fundamentals for Engineers

Monitoring vs Observability

An engineer is 45 minutes into an incident. The dashboard shows green: CPU is fine, memory is fine, the request rate looks normal. But users are complaining about slow page loads. The engineer keeps adding panels: disk I/O, network throughput, GC time. Still green. An hour later, someone notices p99 latency for a specific API endpoint has tripled: but only for requests from one region, only for a particular tenant, only for a specific endpoint variant. The dashboard does not show this because nobody pre-aggregated data along those dimensions. The data exists; the questions needed to find the problem were not asked in advance. This is the gap between monitoring (answering pre-defined questions) and observability (asking new questions).

The monitoring-vs-observability distinction is not marketing. It is the practical difference between "I can answer the questions I thought to ask" and "I can investigate problems I did not anticipate." For distributed systems: where the combinations of services, regions, versions, and user segments multiply faster than anyone can dashboard, the difference determines how fast you resolve incidents. This lesson sets up the mental model for the rest of the course.

The Old World: Monitoring

Monitoring answers a pre-defined question: "is the system up?" You decide in advance what metrics to collect, what thresholds matter, and what alerts to fire. The tools (Nagios, Zabbix, classic CloudWatch) are great at this.

Check: is the HTTP endpoint returning 200? (alert if not)
Check: is the CPU above 80%? (alert if yes)
Check: is the disk > 90% full? (alert if yes)

Everything you can monitor is a question you thought to ask ahead of time. For single-node systems or simple stacks, this is enough, you know the failure modes.

The New World: Observability

Observability is the property of a system from which you can ask new questions about its behavior without deploying new code. It is measured by whether you can answer questions like:

"Why is p99 latency 3x higher for this customer, in this region, on this endpoint, only during 2-4 AM?"
"Which downstream service's slow response is causing the cascading timeout?"
"Why are retries spiking: is it the client, the network, or a specific backend pod?"

You did not know to ask those specific questions when you designed the metrics. Observability means your telemetry (the combination of metrics, logs, and traces) captures enough signal that you can query into new questions during an investigation.

KEY CONCEPT

The one-line difference: monitoring tells you whether the system is doing what you expected; observability helps you understand what it is actually doing when your expectations break down. Monitoring is about pre-defined checks. Observability is about the ability to pose and answer new questions in real time.

Known Unknowns vs Unknown Unknowns

Donald Rumsfeld's framework applies surprisingly well:

Known knowns: things you know and measure: request rate, error rate, p95 latency. Dashboards for these.
Known unknowns: things you know you do not know and can investigate: "is today slower than last week?" Pre-built ad-hoc queries.
Unknown unknowns: problems you did not anticipate. Observability is what makes these tractable.

Traditional monitoring covers known knowns very well. It cannot help with unknown unknowns, you cannot alert on a condition you did not anticipate. Observability's job is to let you investigate novel problems by querying raw telemetry.

Why Distributed Systems Forced the Shift

A 2005-era monolith had ~5 failure modes and you could name them all. A modern architecture has:

Dozens of microservices (sometimes hundreds at scale).
Multiple regions and availability zones.
Several deployment versions simultaneously (canary, stable, old).
Per-customer, per-plan, per-feature-flag behaviors.
Multi-tenant concurrency interactions.

The cross-product of these dimensions is enormous. Any specific "the login page is slow for enterprise customers in EU-west on v1.2.3 using feature flag X" scenario is one combination out of millions you cannot pre-alert. You can only investigate it when it happens, if your telemetry supports the query.

The Three Pillars (and Why They Are Not Enough)

The canonical answer to what observability is:

Pillar	What it is	Best for
Metrics	Numeric time-series (counters, gauges, histograms)	Is something wrong? and aggregate trends
Logs	Records of discrete events with timestamps	What exactly happened at this time?
Traces	End-to-end record of a request across services	Where did this request spend its time?

Each pillar answers different questions:

Metrics are cheap, aggregate, and fast to query. "Error rate is up 10x" is a metric.
Logs are expensive at scale but carry rich per-event detail. "User 12345 got this exact stack trace" is a log.
Traces follow one request through many services. "This request spent 3 seconds in the auth service" is a trace.

Modern practice also recognizes a fourth pillar or two: events (like logs but more structured and queryable), profiles (continuous CPU/memory profiling), real-user monitoring (RUM). The pillars are a loose taxonomy, not a strict schema.

PRO TIP

Three pillars is a useful mnemonic but deceptive, it suggests three separate tools. Real observability requires correlation between them: find the suspect time window in metrics, drill to logs from that window, follow the trace IDs in those logs to the full request path. Treating pillars as isolated tools produces what engineers call observability silos, you can query each but cannot connect them.

The Observability Stack

An Incident Walkthrough: Why You Need All Three

Consider a real scenario: p99 checkout latency doubled at 14:30.

Step 1: metrics. Query histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="checkout"}[5m])). Confirm the spike. Break it down by pod, by region, by endpoint. Discover the spike is only on one pod in us-east.

Step 2: logs. Filter logs to that specific pod around 14:30. See a flood of retrying-database-connection messages. Note the correlation IDs.

Step 3: traces. Look up traces for those correlation IDs. Follow the spans. See that the checkout service spent 2.8s in the inventory service, which spent 2.5s waiting for a lock on a specific table.

Root cause: a long-running batch job locked the inventory table at 14:30, cascading into slow checkout requests on that region's pods.

Without all three pillars:

Metrics alone: you see the latency spike but cannot isolate which dependency.
Logs alone: a firehose of text; you cannot see the latency pattern.
Traces alone: per-request detail, but no sense of the scale (is this one request or ten thousand?).

Combined: the diagnosis took 15 minutes instead of 3 hours.

WAR STORY

A team built a perfect Grafana dashboard, 30 panels covering every services golden signals. During an incident, the dashboard proudly reported everything green. The incident was a slow-leak bug where one specific users requests were failing, which did not move aggregate metrics. Only when someone queried logs with that users ID did the issue surface. Lesson: metrics are for patterns; logs and traces are for specific cases. A dashboard that only shows aggregates misses long-tail issues, which are disproportionately where real customer complaints originate.

Mental Model Shifts

A few practical shifts moving from monitoring to observability:

From dashboards-for-everything to dashboards-for-the-top-5-questions

Pre-observability teams build dashboards for every metric just in case. Modern teams build a small number of high-signal dashboards focused on SLO-aligned metrics, and rely on ad-hoc queries for the long tail.

From alerts-on-every-metric to alerts-on-symptoms

Alert on user-facing impact (error rate high, latency elevated, SLO burning fast). Not on every CPU spike, which is usually not a problem on its own. We cover this in Module 6.

From fixed labels to high-cardinality investigation

Prometheus design imposes label-cardinality limits. For deep observability of individual users or requests, newer tools (Honeycomb, Elasticsearch, OpenSearch) handle high cardinality natively. We cover cardinality in Lesson 1.3.

From snowflake tools per team to unified platform

Metrics in Prometheus, logs in Elasticsearch, traces in Jaeger, events in Splunk, each with its own UI. You lose correlation. OpenTelemetry and unified backends (Datadog, Grafana Cloud, Honeycomb) stitch them together.

Modern Terminology

Observability (o11y). Coined in control theory (Rudolf Kálmán, 1960s); adopted in software around 2016 by Twitter's engineering. Popularized by Honeycomb and distributed-systems teams.
Telemetry. The data: metrics, logs, traces, profiles. "Our telemetry pipeline ingests 5 TB/day."
SLO (Service Level Objective). A target reliability number like 99.9% availability. Module 5.
SRE (Site Reliability Engineering). Google's practice of applying software-engineering discipline to operations. Observability is one SRE pillar.
Cardinality. Number of unique label combinations. Lesson 1.3.
Pillar. Metric, log, or trace. Sometimes extended to events, profiles.
Three pillars and its criticisms. Some argue the pillars model encourages silos; modern thinking (Charity Majors' observability 2.0) emphasizes wide events that can be queried along any axis.

Where This Course Focuses

The rest of the modules go deep on the core tools and practices:

Module 2: Metrics + Prometheus + PromQL (the most widely deployed foundation).
Module 3: Logs (structured, correlated, cost-aware).
Module 4: Distributed tracing + OpenTelemetry (vendor-neutral standard).
Module 5: SLIs/SLOs/error budgets (the reliability framework).
Module 6: Dashboards, alerting, and incident response, tying it all together.

Each is a standalone topic, but they form one workflow: you instrument services with OpenTelemetry, ship telemetry to your backend (Prometheus + Loki + Tempo, or a SaaS), define SLOs, alert on SLO burns, and during incidents pivot between pillars to find root cause.

Key Concepts Summary

Monitoring answers pre-defined questions; observability helps you ask new ones during investigation.
The three pillars: metrics (aggregates), logs (events), traces (request paths). They correlate best when tied together with shared IDs.
Distributed systems forced the shift. Pre-defined dashboards cannot cover the combinatorial space of services x regions x versions x customers.
Alerts should target symptoms (user-facing issues), not causes (every CPU spike).
Telemetry pipelines produce huge data. Cost-awareness (cardinality budgets, log sampling) is an operational concern from day one.
OpenTelemetry has become the standard instrumentation layer, vendor-neutral metrics + logs + traces from one SDK.
SLOs (Module 5) are the contract that makes reliability measurable rather than vibes-based.
The three pillars are not literally three tools. Correlation across them is where observability wins over classical monitoring.

Common Mistakes

Building 50-panel dashboards for everything. Humans can focus on 5-8 panels at once; the rest is noise.
Alerting on causes (CPU, memory) not symptoms (error rate, latency). Produces alert fatigue.
Treating metrics, logs, and traces as separate, non-correlated tools. Without trace IDs in logs, you cannot pivot.
Logging everything at INFO without structure. Volume grows, useful signal drops.
Assuming high-cardinality queries work on Prometheus. They often do not; use event/tracing tools for per-user, per-request detail.
Not tying telemetry to SLOs. Nice charts with no clear is-this-bad signal.
Buying an observability SaaS before instrumenting with OpenTelemetry. Vendor lock-in; rewrite work when you switch.
Running separate tools per team (dev uses Datadog, ops uses Prometheus, security uses Splunk). Correlation dies at boundaries.
Over-alerting in the first year; under-tuning afterwards. Alerts should be reviewed and pruned every quarter.
Treating observability as a tool purchase, not an engineering practice. Tools help; but the discipline (instrument, label well, define SLOs) is where the value comes from.

KNOWLEDGE CHECK

Your team has a classic Nagios-style setup: HTTP healthchecks every 30s, CPU alerts, disk alerts. A user-visible bug causes checkout failures for about 2 percent of users on one endpoint. Nagios shows everything green. What is missing, and what kind of observability would surface this class of issue faster?

Continue

The Four Golden Signals

←→ navigateM toggle sidebar