Observability Fundamentals for Engineers

OpenTelemetry Fundamentals

For a decade, every tracing vendor shipped its own SDK. Jaeger had one, Zipkin had another, Datadog had a third, New Relic had a fourth. Switching vendors meant reinstrumenting your entire codebase. Using two vendors meant running two SDKs side by side, both lying about clock skew.

OpenTelemetry (OTel) fixed this. It is a vendor-neutral standard for instrumentation. You write the instrumentation once against the OTel API. A collector translates that into whatever backend you are using: Jaeger, Tempo, Datadog, Honeycomb, Dynatrace, your own ClickHouse. Switch backends by changing a config file, not your code.

KEY CONCEPT

OpenTelemetry is now the default. Every major observability vendor supports it as a first-class input format. Every Kubernetes-native tracing stack (Tempo, Jaeger v2) speaks OTLP. New services should instrument with OTel, full stop.

What OpenTelemetry actually is

OTel is three things in a trench coat:

The critical architectural decision OTel made: the API and the SDK are separate. Your application depends on the stable API. The SDK is configured at runtime. This means instrumentation code does not have to change when you change tracing backends.

The three signals OTel covers

Originally OTel was just tracing (it was a merger of OpenTracing and OpenCensus). It now covers all three pillars:

Traces: spans and trace context. The most mature signal.
Metrics: a full metrics API (counters, gauges, histograms). Interoperable with Prometheus.
Logs: structured log records with trace context. Newest; still stabilising.

The goal of OTel-logs is that every log automatically carries trace_id and span_id, so the correlation we built manually in the last module happens without your code knowing. In practice, most teams still use their language's native logger and layer OTel context in via a handler, which is fine.

Auto-instrumentation vs manual instrumentation

Auto-instrumentation

For most frameworks, OTel provides plugins that instrument the framework automatically. Drop in a dependency, set an environment variable, and every HTTP handler / DB query / HTTP client call gets a span automatically.

Python, zero code change needed:

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run your app with the instrumenting wrapper
OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument python main.py

That one command instruments Flask, Django, FastAPI, requests, httpx, psycopg, SQLAlchemy, redis, boto3, anything the OTel contrib packages recognize.

Java, a single JVM agent:

java -javaagent:/opt/otel-javaagent.jar \
  -Dotel.service.name=my-service \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
  -jar my-app.jar

Node.js:

npm install --save @opentelemetry/auto-instrumentations-node
node --require @opentelemetry/auto-instrumentations-node/register app.js

Auto-instrumentation gives you 80% of the value on day one.

Manual instrumentation

For the 20% that auto-instrumentation misses: your own business logic, background jobs, custom protocols, you write spans manually:

Go:

import "go.opentelemetry.io/otel"

tracer := otel.Tracer("my-service")

func placeOrder(ctx context.Context, orderID string) error {
    ctx, span := tracer.Start(ctx, "place_order",
        trace.WithAttributes(
            attribute.String("order.id", orderID),
        ),
    )
    defer span.End()

    if err := validateOrder(ctx, orderID); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "validation failed")
        return err
    }

    return saveOrder(ctx, orderID)
}

Python:

from opentelemetry import trace

tracer = trace.get_tracer("my-service")

def place_order(order_id: str):
    with tracer.start_as_current_span("place_order") as span:
        span.set_attribute("order.id", order_id)
        try:
            validate_order(order_id)
            save_order(order_id)
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise

PRO TIP

Start with auto-instrumentation. Add manual spans only for the business operations you care about: "place_order", "process_payment", "run_ml_inference". Do not manually span every function. That is what traces and profilers already do.

Semantic conventions: use the standard names

OTel defines standard attribute names for common operations. Using them means every backend and every dashboard understands your data without per-team translation.

Domain	Standard attribute names
HTTP	`http.request.method`, `http.route`, `http.response.status_code`
DB	`db.system`, `db.statement`, `db.operation.name`
Messaging	`messaging.system`, `messaging.destination.name`, `messaging.operation`
Cloud	`cloud.provider`, `cloud.region`, `cloud.account.id`
Kubernetes	`k8s.pod.name`, `k8s.namespace.name`, `k8s.node.name`
Exception	`exception.type`, `exception.message`, `exception.stacktrace`

// USE the standard names
span.SetAttributes(
    semconv.HTTPRequestMethodKey.String("POST"),
    semconv.HTTPRouteKey.String("/orders"),
    semconv.HTTPResponseStatusCodeKey.Int(201),
)

// Not your own
span.SetAttributes(
    attribute.String("method", "POST"),         // non-standard
    attribute.String("url", "/orders"),         // non-standard
    attribute.Int("status", 201),               // non-standard
)

WARNING

Dashboards, alerts, and even the OTel processors assume the standard names. Using your own names means you will not benefit from the ecosystem tooling.

The OTel Collector

The collector is the single most important piece of the OTel story. It decouples applications from backends.

What the collector does

Receives OTLP (and optionally Jaeger, Zipkin, Prometheus, etc.).
Processes: batch, sample, filter, redact, add resource attributes (like k8s.pod.name), convert between signals.
Exports: to one or more backends in their native format.

A minimal collector config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: insert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  otlp/datadog:
    endpoint: datadog:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/tempo, otlp/datadog]

This config receives OTLP, batches spans, tags every span with deployment.environment = production, and ships to both Tempo and Datadog.

Where to run the collector

Three common topologies:

Agent (DaemonSet): one collector per node. Apps send to localhost:4317. Low latency, no network hop. Good default.
Gateway (Deployment): a central pool of collectors. Apps or agents send to it. Central point for applying tail sampling, auth, large transformations.
Agent → Gateway: both. Agents batch and enrich, gateways do tail sampling and export. The standard for large deployments.

KEY CONCEPT

Tail sampling requires seeing all spans of a trace together. Agents cannot do this alone; you need a gateway layer where all traces for a trace_id land on the same collector instance.

OTLP: the protocol

OTLP (OpenTelemetry Protocol) is how the SDK talks to the collector and how the collector talks to backends. Two transports:

gRPC (port 4317): default, efficient, preferred.
HTTP/Protobuf (port 4318): works through more firewalls, useful when gRPC is blocked.

The important thing: OTLP is the native format. Every OTel-compatible backend (Tempo, Jaeger v2, Datadog, Honeycomb, Dynatrace, Splunk, New Relic) ingests OTLP directly. You are not dependent on vendor adapters.

Resource attributes: who is sending this?

Every span / metric / log in OTel carries resource attributes, metadata about the thing producing the telemetry. These are set once per process (at SDK init) and attached to everything it sends.

service.name         = orders-service       # REQUIRED
service.version      = v1.42.3
service.instance.id  = orders-service-pod-abc123
deployment.environment = production
k8s.namespace.name   = ecommerce
k8s.pod.name         = orders-service-pod-abc123
k8s.node.name        = ip-10-0-1-42

Most of these can be auto-detected by the SDK using OTEL_RESOURCE_ATTRIBUTES and Kubernetes detectors. The collector can also enrich them with k8sattributes processor.

PRO TIP

service.name is the single most important attribute. Every backend keys off it: dashboards, metrics aggregations, error grouping. Set it explicitly via OTEL_SERVICE_NAME. Never let it default to unknown_service.

Context propagation across protocols

OTel propagates trace context automatically for HTTP (both sides), gRPC, and most messaging systems when you use the auto-instrumentation. The underlying mechanism:

HTTP: traceparent + tracestate headers (W3C standard).
gRPC: metadata entries with the same keys.
Messaging (Kafka, RabbitMQ, SQS): headers / attributes on the message.

If you build a custom protocol or use something exotic, you need to manually inject/extract:

import "go.opentelemetry.io/otel/propagation"

// Send side — inject
propagator := otel.GetTextMapPropagator()
propagator.Inject(ctx, propagation.MapCarrier(outgoingHeaders))

// Receive side — extract
ctx = propagator.Extract(ctx, propagation.MapCarrier(incomingHeaders))
ctx, span := tracer.Start(ctx, "handle_message")

The metrics story

OTel metrics are a full parallel of the Prometheus data model plus Summary-like behaviour via histograms:

meter := otel.Meter("my-service")

requestCounter, _ := meter.Int64Counter("http.server.requests",
    metric.WithDescription("Total HTTP requests"),
)

requestDuration, _ := meter.Float64Histogram("http.server.request.duration",
    metric.WithUnit("s"),
    metric.WithExplicitBucketBoundaries(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5),
)

// In a handler
requestCounter.Add(ctx, 1,
    metric.WithAttributes(attribute.String("http.route", "/orders")),
)
requestDuration.Record(ctx, duration.Seconds(),
    metric.WithAttributes(attribute.String("http.route", "/orders")),
)

These can be exported to Prometheus (via the collector's Prometheus exporter), to OTLP (to Tempo / Datadog / Honeycomb), or both.

KEY CONCEPT

If you are starting a new project today, use OTel metrics, not a native Prometheus client library. You still get Prometheus as the backend, but your code is vendor-neutral.

What breaks when you get this wrong

The most common OTel pitfalls:

service.name not set → everything groups as unknown_service; dashboards are useless.
Auto-instrumentation not enabled → traces are empty or spotty.
BatchSpanProcessor with tiny timeout → spans get dropped under load.
Context not propagated → each service produces its own disconnected traces instead of one joined trace. Symptom: traces show only one service's spans.
Too many custom attributes → cardinality explosion in the backend (same problem as metrics labels).
Sampler set to AlwaysOn in production → crushing volume + cost.

A minimal end-to-end setup

Everything you need to get tracing working in a new service:

# 1. Install the SDK
npm install @opentelemetry/api @opentelemetry/sdk-node \
            @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-trace-otlp-http

# 2. Set env vars
export OTEL_SERVICE_NAME=my-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc:4318
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

# 3. Start the app with the instrumenting require
node --require @opentelemetry/auto-instrumentations-node/register app.js

No code changes. Traces flow to the collector. Collector exports to your backend. Done.

Quiz

KNOWLEDGE CHECK

Your team is picking a tracing backend. Some engineers want Jaeger (self-hosted), others want Datadog. You are worried about being locked in. Which of these is the right architectural choice?

What to take away

OTel is three things: APIs (what your code calls), SDKs (the in-process machinery), and the Collector (the standalone process that translates to backends).
Start with auto-instrumentation. Add manual spans only for business operations.
Use semantic conventions: http.request.method, not your own attribute names.
The OTel Collector is the decoupling point. Change backends without changing code.
Always set service.name. Always propagate context. Sample thoughtfully.
Prefer OTel metrics for new code: same Prometheus experience, vendor-neutral.

Next lesson: tracing in practice: sampling strategies at scale, cost, and the backend options that matter.

What Tracing Is

Continue

Tracing in Practice

←→ navigateM toggle sidebar