Observability Fundamentals for Engineers

What Tracing Is

Metrics tell you what is happening. Logs tell you what happened in each service. Tracing is how you stitch those logs together into the single story of one request flowing through a distributed system.

If you have ever stared at a dashboard that shows high p99 latency and had to grep across five services to figure out which one was slow, tracing is the pillar you were missing.

KEY CONCEPT

A trace is a tree of what one request did across every service it touched. It is the only tool that makes distributed systems debuggable in practice.

Why tracing exists

In a monolith, a slow request is a stack trace. You can profile it, you can add a timer around each function, you can figure out what is slow.

In a distributed system, a single user request might touch 20 services. The API gateway calls the auth service calls the user service calls postgres calls three different caches calls the recommendation service calls a machine learning model calls another API. Something is slow. Which of those 20 calls is the problem?

Metrics cannot answer it, they only tell you aggregate behavior across all requests. Logs cannot answer it without a trace_id that ties them all together, and even then you are manually reconstructing timing across services. Tracing was designed specifically for this question.

The core concepts: trace, span, context

Three terms you need to be fluent with:

Trace

A trace is the full end-to-end record of one request. It has a unique trace_id and contains every span generated while handling that request, in every service it touched.

Span

A span is one unit of work inside a trace. It has:

A name (typically the operation: POST /orders, SELECT from users).
A start time and an end time (so duration).
A span_id (unique per span).
A parent_span_id (unless it is the root span).
Attributes (key/value metadata: the user ID, the SQL query, the HTTP status).
Events (timestamped messages, like "cache miss" or "retry attempted").
Links to other spans, for async / fan-out cases.

Context

Trace context is what gets passed between services so each one knows "you are part of trace X, your parent span is Y." The standard is W3C Trace Context, which defines a traceparent HTTP header:

traceparent: 00-5e1f9c2b8a6d4e7f90a1b2c3d4e5f607-a1b2c3d4e5f6a7b8-01
              |  |                                |                  |
              |  trace_id                         span_id            flags
              version

Every HTTP client propagates this header to the next service. Every server reads it and creates child spans under it.

PRO TIP

You almost never construct traceparent values by hand. OpenTelemetry's instrumentation libraries do it automatically, your job is to make sure you are using them and that they are actually being called.

What tracing answers that metrics and logs cannot

Tracing answers three specific questions better than any other tool:

1. Where did the time go?

A dashboard says p99 is 800ms. Tracing tells you: 700ms of that 800ms was spent in inventory-service → external-wms. Now you know where to look.

2. What is the dependency graph in practice?

Your architecture diagram says service A calls B and C. Tracing shows that in production, A also calls D (via a deprecated code path no one documented), and every call to A fans out to B 3 times because of a retry loop.

3. What actually happened for this request?

A user reports they got a timeout at 14:23:01. You paste their request ID into the trace tool and see the exact call graph, down to the SQL query parameters. No reconstruction required.

When tracing is the right tool

Question	Best tool
How many requests are failing?	Metrics
Is error rate up compared to last week?	Metrics
What is p99 latency over time?	Metrics
Did a specific request succeed?	Logs
What error did the recommendation service return at 14:23:01?	Logs
Why is p99 high?	Traces
Which downstream call is slow?	Traces
What is the actual call graph in production?	Traces
Why did this request take 3 seconds?	Traces

If the question is about individual requests and how they flowed through multiple services, it is a tracing question.

Anatomy of a span

A well-instrumented span carries enough attribute metadata to be useful without the logs:

{
  "trace_id": "5e1f9c2b8a6d4e7f90a1b2c3d4e5f607",
  "span_id": "a1b2c3d4e5f6a7b8",
  "parent_span_id": "0000000000000000",
  "name": "POST /orders",
  "service": "api-gateway",
  "start_time": "2026-04-19T14:23:01.200Z",
  "end_time": "2026-04-19T14:23:01.740Z",
  "duration_ms": 540,
  "attributes": {
    "http.method": "POST",
    "http.route": "/orders",
    "http.status_code": 200,
    "user.id": "42",
    "order.amount_cents": 4250,
    "net.peer.name": "orders-service"
  },
  "events": [
    { "time": "2026-04-19T14:23:01.310Z", "name": "cache_miss" },
    { "time": "2026-04-19T14:23:01.650Z", "name": "retry_attempted", "attempt": 2 }
  ],
  "status": { "code": "OK" }
}

The OpenTelemetry spec defines semantic conventions: standard attribute names for HTTP, DB, RPC, queue, FaaS, and other operations. Use them. http.status_code is the standard, not status or http_status.

The tracing data model: not a tree, a DAG

Most of the time a trace looks like a tree. But some workloads break that shape:

Fan-out / fan-in: one span starts many parallel spans, then waits for all of them. Parent-child links form a tree, but you also need sibling timing.
Queue / async: a span publishes a message, a different span consumes it later. The consumer is linked to the producer with a span_link, not a parent-child relationship.
Batch operations: one span processes a batch of 100 messages, each of which came from a different upstream trace. The span has span_links to 100 other traces.

KEY CONCEPT

OpenTelemetry models this as a DAG, spans have parent relationships and arbitrary span links. Most UIs render as a tree; the links are how you follow a message from producer to consumer.

Sampling: you will not trace everything

At scale, you cannot afford to record every span of every trace. 10k RPS × 5 services × 5 spans each × 2KB per span = 500 MB/s of trace data. That is $$ at any vendor.

Every tracing pipeline samples. Two main strategies:

Head sampling

Decide at the start of the trace whether to record it. Simple, fast, but cannot be "smart", the decision is made before you know if the trace is interesting.

sampling_rate: 0.01  # 1% of traces

Tail sampling

Collect all spans at the collector, then decide per-trace whether to ship them to storage. Expensive at the collector, but lets you keep 100% of interesting traces (errors, slow, rare paths) and drop uninteresting ones.

# Example tail-sampling policy
- name: keep_errors
  type: status_code
  status_code: ERROR
- name: keep_slow
  type: latency
  threshold_ms: 1000
- name: probabilistic_rest
  type: probabilistic
  sampling_percentage: 1

Tail sampling gives you 100% of errors and slow traces, plus a 1% sample of successful fast traces. That is usually what you want.

Tracing and the three-pillar debugging flow

This is the pattern most mature teams use during incidents. Walk top-down:

Alert fires (metrics): p99 latency > 500ms for service X.
Open the metric dashboard (metrics): which endpoints, which pods, when did it start?
Open the tracing tool (traces): find traces in the slow bucket. Look at the span tree. Identify which downstream call is slow.
Open the logs (logs): filter by trace_id on the slow downstream service. See the exact error or log lines for that request.

PRO TIP

The speed of this workflow is the payoff for investing in observability. A team with all three pillars properly integrated debugs an incident in 10 minutes. A team without debugs the same incident in 4 hours.

What tracing is NOT

Not a profiler. Tracing shows time in each service. Within a service, you still need CPU profiling, flame graphs, and conventional performance tooling for hot-loop issues.
Not a complete replacement for logs. Logs capture events; traces capture timing. You need both, and the trace_id on every log is what makes them complementary.
Not free. It costs engineer time to instrument, CPU time to collect, bandwidth to ship, and storage to keep. Sampling is not optional.
Not retrofittable into badly-architected services. If your service does blocking I/O in unnamed goroutines, your spans will be missing half the work.

Quiz

KNOWLEDGE CHECK

A user reports their checkout is slow. You open your dashboards and see the overall p99 is normal, 200ms. But this user says their request took 8 seconds. What is the right tool to use next?

What to take away

A trace is the story of one request across every service it touched.
A span is one unit of work inside a trace: with start/end time, attributes, and a parent.
Context propagation (W3C traceparent header) is what links spans across services.
Use tracing when the question is about individual requests or cross-service timing, not aggregate rates.
Tail sampling (keep all errors + slow, sample the rest) is usually what you want at scale.
The debugging flow is: metrics find the problem → traces localize it → logs give the detail.

Next lesson: OpenTelemetry fundamentals, the standard instrumentation layer for making all of this work.

Log Aggregation and Cost

Continue

OpenTelemetry Fundamentals

←→ navigateM toggle sidebar