Observability Fundamentals for Engineers

Tracing in Practice

Knowing what tracing is and knowing how to actually run it at scale are two different things. The first trace pipeline every team builds works great at 100 RPS. At 10,000 RPS you discover that you cannot afford to keep every span, you cannot run traces through a single collector, and you cannot find the trace you need when the backend has a billion of them.

This lesson is about the operational side: sampling strategies at scale, what to trace and what to skip, how to spend a reasonable tracing budget, and which backends are actually viable in production.

KEY CONCEPT

Tracing without sampling does not scale. The question is not whether to sample: it is whether to sample at the head, the tail, or both. The answer shapes your entire architecture.

The sampling decision: head vs tail

Head sampling

The decision happens at the start of the trace. The first service generates a random number, checks it against the sampling rate, and sets a flag in the trace context. Every downstream service honours the flag.

Pros:

Cheap. You don't emit spans for unsampled traces at all.
Predictable cost. Fixed percentage of traffic traced.
Simple. No coordination between services needed.

Cons:

Dumb. You decide before you know if the trace is interesting. A 0.1% sample rate means 99.9% of errors are invisible.

When to use it: early in the lifetime of a system, when you just need some traces, or when you are cost-constrained and have metrics to tell you when to look.

Tail sampling

Every service emits every span. The collector buffers all spans of a trace until the trace finishes (or a timeout), then decides per-trace whether to keep it.

Pros:

Smart. Decides after seeing the whole trace: keep errors, keep slow traces, sample successful fast traces.
No interesting traces lost.

Cons:

Expensive in collector resources. Every span has to be buffered long enough for the trace to complete.
Requires a gateway layer: all spans of a given trace must land on the same collector instance.

When to use it: once you have enough traffic that head sampling loses interesting signal, which for most services happens around 1,000 RPS.

Tail sampling policies: what to actually keep

A realistic tail sampling policy at production scale:

A concrete collector config for this policy:

processors:
  tail_sampling:
    decision_wait: 10s  # how long to wait for the trace to complete
    num_traces: 100000  # in-memory buffer size
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      - name: slow
        type: latency
        latency:
          threshold_ms: 500

      - name: canary
        type: string_attribute
        string_attribute:
          key: deployment.canary
          values: ["true"]

      - name: rare_routes
        type: rate_limiting
        rate_limiting:
          spans_per_second: 10

      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 2

PRO TIP

Build your tail-sampling policy incrementally. Start with "keep 100% of errors and 1% of everything else." Add policies (slow, canary, rare routes) only when you discover you are missing traces you want. Overly clever policies are hard to reason about.

How much does tracing cost?

A realistic per-span cost model at managed backends:

Typical span: 2 KB (with attributes, events, resource)
Typical service: 5-10 spans per request
Typical retention: 7 days

At 10,000 RPS, 100% sampling:
10,000 × 10 × 2 KB × 86,400s × 7 days = ~120 TB

At 1-2% sampling (99% tail-sample to baseline), that becomes ~1-2 TB, which most vendors charge roughly $1-5k/month for.

Metrics and logs pricing cover broader cost lessons; here the key insight is: the sampling rate dominates the cost. Going from 100% to 1% is a 100× cost reduction, and tail sampling gives you nearly the same utility.

What to trace: the coverage question

You do not have to trace everything. Cover the paths that matter:

Do trace

HTTP request handlers (in and out). This is table stakes.
gRPC calls.
Database queries.
Cache operations (if the cache is in the critical path).
Outbound API calls to third parties.
Message publishes and consumes (with span links to connect them across the async boundary).
Background jobs / cron runs.
Business operations you care about, place_order, process_payment, run_ml_inference. These are the spans your dashboards will care about.

Do NOT trace

Hot in-process functions. The overhead is noticeable and the value is low. Use CPU profiling.
Health checks and readiness probes. Drop these at the collector.
Prometheus /metrics scrapes. Same.
Internal loops / batch items individually. Trace the batch, not each item.
Your own logger internals. Do not trace log.info().

WARNING

Overtrace is worse than undertrace. Hundreds of trivial spans in a trace make the span tree unreadable and expensive to store. Spans are for operations that cross a boundary (process, network, DB, queue), not for every function call.

Finding traces when you need them

Having traces is only useful if you can find the one you need. Four common access patterns:

1. By trace_id

The most basic: paste a trace_id from a log or error report into the backend and see the trace. Every backend supports this.

2. By service + attribute

"Show me traces of orders-service where order.amount_cents > 100000." Requires that the relevant attribute is searchable. Most backends index a limited set of attributes, check which ones, or explicitly configure them.

3. By duration

"Show me traces of orders-service where duration > 1s." The single most useful filter when investigating latency issues.

4. By error

"Show me errored traces of payments-service." Some backends auto-identify errors via status.code = ERROR.

PRO TIP

Add the trace_id to your error responses and log lines. When a user reports a problem, a trace_id in their error screenshot makes the investigation 10× faster.

The viable tracing backends

The useful tracing backends in 2026 (opinionated):

Self-hosted, OSS

Grafana Tempo: object-storage-backed. Cheap. Scales to billions of spans. Queries via TraceQL. The default choice for Kubernetes-native stacks.
Jaeger v2: native OTel support, Elasticsearch or Cassandra backend. More operational overhead than Tempo.
ClickHouse + custom schema: DIY. Cheap, fast, but you build the UI. Big companies only.

Managed

Datadog APM: feature-rich, integrated with metrics/logs/RUM. Expensive. Excellent auto-instrumentation.
Honeycomb: best-in-class for trace search and high-cardinality queries. Built around event-based model.
New Relic, Dynatrace, Splunk Observability, enterprise. Strong in large orgs; price gets painful at volume.

What to pick

Starting a new stack on Kubernetes? Grafana Tempo. Cheap, OTel-native, integrates with Grafana.
Already on Datadog for metrics/logs? Use Datadog APM.
High-cardinality debugging needs (ML, finance, complex business logic)? Honeycomb.
Cannot / will not self-host, and cost is the dominant constraint? Honeycomb's free tier is generous; Datadog's is narrow.

Trace-to-logs and trace-to-metrics linking

A correctly set-up observability stack lets you click through from any point in a trace to the logs or metrics for that service at that moment:

Trace → logs: logs filtered by trace_id (which is already there because your logger writes it).
Trace → metrics: metrics for this service.name around this timestamp (which any Grafana dashboard can do).

Grafana (the tool) links these natively when Tempo + Loki + Prometheus are configured together. Datadog / New Relic do this via the integrated UI. You can build the links manually in any setup by including trace_id in logs and using the service name as the common key.

KEY CONCEPT

This cross-signal linking is what makes an observability stack feel cohesive. The 10-minute incident vs 4-hour incident gap is largely about whether these links are set up. Do them once; use them forever.

The async / messaging case

Traces across message queues look different. The publisher and the consumer are not in the same request, they are linked but not parent-child.

// Publisher
ctx, pubSpan := tracer.Start(ctx, "kafka.publish",
    trace.WithSpanKind(trace.SpanKindProducer),
)
defer pubSpan.End()

// Inject trace context into the Kafka message headers
msgHeaders := make(map[string]string)
otel.GetTextMapPropagator().Inject(ctx, propagation.MapCarrier(msgHeaders))
// ... send the message

// Consumer (runs later, different process, different request)
ctx := otel.GetTextMapPropagator().Extract(ctx, propagation.MapCarrier(msgHeaders))
ctx, consumeSpan := tracer.Start(ctx, "kafka.consume",
    trace.WithSpanKind(trace.SpanKindConsumer),
    // Link the consume span to the publish span
    trace.WithLinks(trace.LinkFromContext(ctx)),
)
defer consumeSpan.End()

Most backends can visualize this, the consume span shows up as a linked trace. Jaeger and Tempo both support it.

War story: the distributed timeout mystery

A team I worked with had a rare bug: one in every few thousand requests took exactly 30 seconds, then failed. Dashboards showed occasional p99 spikes but could not localize the cause.

We enabled tail sampling with a "slow trace" policy (duration > 5s). Within a day we had 50 slow traces captured. Every one of them showed the same thing:

api-gateway → orders-service → postgres (succeeded in 20ms)
api-gateway → recommendation-service → grpc call (hung 30s, timed out)

The recommendation service had a gRPC client with a 30-second default timeout and no client-side deadline. Under low probability the connection entered a zombie state that the client did not detect. The fix was a client-side deadline of 500ms.

Without tracing, that would have taken weeks to find. With the right sampling policy, it took one day.

KEY CONCEPT

Tail sampling pays for itself on the first incident where you find the slow call you could not have found any other way.

Quiz

KNOWLEDGE CHECK

Your service handles 5,000 RPS. You want to implement tracing. You are worried about cost. Which is the best strategy?

What to take away

Sampling is mandatory at scale. Head sampling is cheap and dumb; tail sampling is expensive and smart.
Tail sampling policies should keep 100% of errors, 100% of slow traces, and sample ~1-2% of baseline.
Do not trace everything. Cover the paths that cross boundaries (HTTP, DB, queues, RPC) and the business operations you care about. Do not trace internal functions.
Add trace_id to your error responses and log lines. It is how you find the right trace later.
Cross-signal links (metrics ↔ traces ↔ logs) are what make observability feel cohesive. Do them once; benefit forever.
Tempo is the pragmatic OSS default in 2026; Honeycomb is the best managed for high-cardinality debugging; Datadog for integrated vendor stacks.

Next module: SLOs and error budgets, the framework that turns observability data into engineering decisions.

OpenTelemetry Fundamentals

Continue

SLI: What You Measure

←→ navigateM toggle sidebar