Observability Fundamentals for Engineers

Log Levels and What They Mean

Every logging library has the same five levels: DEBUG, INFO, WARN, ERROR, FATAL. Every team uses them slightly differently, and most teams use them wrongly enough that their logs are either drowning in noise or silent about things that matter.

This lesson is about what each level is supposed to mean, the mistakes engineers make at each level, and how to pick the right level so future-you can filter sanely during an incident.

KEY CONCEPT

Log levels are contracts with your future on-call self. ERROR means wake someone up. WARN means read this next time someone opens the logs. INFO means keep this forever. Once teams agree on what each level means, filtering becomes useful.

The standard levels

From least to most severe:

FATAL exists in some libraries (panic-level errors that crash the process). You will rarely log at this level explicitly, a fatal is almost always a stack trace from an unhandled exception.

DEBUG: what it is actually for

DEBUG is what you enable temporarily to diagnose a specific issue in production. It is the level where you log things like:

The SQL query that was executed and its parameters.
The decision branches taken by a complex piece of business logic.
The contents of a cache key lookup.
The raw response from an external API before parsing.

slog.Debug("cache lookup",
    "key", cacheKey,
    "hit", hit,
    "ttl_remaining", ttlRemaining,
)

Disabled by default. Enabled via a runtime flag (environment variable, feature flag, or dynamic config) without a restart.

WARNING

Do not use DEBUG for things that should always be visible. If you find yourself thinking "let me add a debug log so I can see it if anything breaks," that is actually an INFO log.

The prod DEBUG pattern

Most mature teams support enabling DEBUG for a subset of requests, typically by adding a header or feature flag. This keeps logs quiet at scale while still giving engineers a way to see what is happening for a specific request.

if r.Header.Get("X-Debug-Log") == debugToken {
    ctx = context.WithValue(ctx, logLevelKey, slog.LevelDebug)
}

INFO: the business event level

INFO is the audit trail of what your service did. Every INFO log should represent a meaningful business event:

A request was handled.
A job was started or completed.
A user was authenticated.
A payment was processed.
A deployment finished.

slog.Info("request completed",
    "path", r.URL.Path,
    "method", r.Method,
    "status", rw.StatusCode,
    "duration_ms", dur.Milliseconds(),
)

slog.Info("order placed",
    "order_id", orderID,
    "user_id", userID,
    "amount_cents", amount,
)

If you look at your INFO logs for the last hour, you should be able to reconstruct what the service did. If you cannot, you are either logging too little at INFO or too much non-business chatter.

PRO TIP

A useful self-check: pick a random INFO log line and ask "would this help an engineer six months from now understand what happened?" If the answer is no, demote it to DEBUG or delete it.

The INFO volume problem

Most teams log too much at INFO. Common offenders:

"Entering function X": use tracing, not logs.
"Config value Y is Z" at startup, logged once per module, consolidate into a single structured startup log.
"Cache hit for key K" on every request, this is metrics territory.
"Heartbeat / tick" from a loop: this is DEBUG at best, or just a metric.

The test: would I want to keep this for 90 days at ~1 KB per line across 1000 pods?

WARNING

INFO log volume has a direct dollar cost. Most managed log platforms charge by GB ingested. 1000 pods × 100 log lines/second × 1KB × 86400s = 8.6 TB/day. At typical cloud pricing, that is roughly $3,000/day just to store INFO logs. Cut it in half by dropping useless lines and you save $500k/year.

WARN: the "someone should look at this" level

WARN means "something unusual happened, but the system handled it." A WARN log is not an incident, it is a signal that something is trending in the wrong direction.

Good WARN examples:

Retry succeeded after one or more failures.
Fallback to a secondary service because the primary timed out.
Deprecated API endpoint was called.
Rate limit exceeded for a client (throttle kicked in, request eventually succeeded).
Config value is close to a limit ("connection pool 90% full").

slog.Warn("retry succeeded",
    "attempt", attempt,
    "total_attempts", maxRetries,
    "upstream", "payments-service",
)

slog.Warn("fallback triggered",
    "primary", "ml-recommender",
    "fallback", "popularity-based",
    "reason", "primary_timeout",
)

KEY CONCEPT

WARN is the level engineers most often skip. They jump from INFO to ERROR. But the bugs that show up in postmortems often have a trail of WARNs leading up to them: retries, fallbacks, degraded paths. Do not underuse this level.

When NOT to use WARN

A user did something wrong (sent invalid input). That is an INFO or DEBUG, it is expected. Use a metric like http_requests_total{status="400"} to track volume.
A feature flag evaluated to "off". That is just a business event, INFO at most.
A 404 happened. Same as above.

If the log would read "a user did something that I am forbidden from letting them do," it is INFO or a metric, not WARN. WARN is for things that surprise your code.

ERROR: the "something broke" level

ERROR means a thing that should have worked did not. A request failed, a write was rejected, a downstream call returned 5xx, a background job exhausted retries.

slog.Error("db query failed",
    "query", "INSERT INTO orders",
    "err", err,
    "duration_ms", dur.Milliseconds(),
)

Rules for ERROR

An ERROR represents a failure. Not an unusual condition, not an edge case. A thing that should have worked did not.
An ERROR should be actionable. Either a human needs to look at it, or a system needs to be fixed. "Error 42 from API" with no other context is worse than nothing.
Include the full error chain. Wrapping errors lose context. Log the outermost error, but make sure it preserves the root cause.
Log once per failure, at the boundary. Do not log "error" at every layer of the stack as it unwinds. Log it once, as high up as possible, where you have the full context.

// WRONG — logs the same error 3 times
func inner() error {
    err := callSomething()
    if err != nil {
        slog.Error("inner failed", "err", err)
        return err
    }
}
func middle() error {
    err := inner()
    if err != nil {
        slog.Error("middle failed", "err", err)
        return err
    }
}
func outer() error {
    err := middle()
    if err != nil {
        slog.Error("outer failed", "err", err)
        return err
    }
}

// RIGHT — log once, at the boundary, with wrapped context
func inner() error {
    err := callSomething()
    if err != nil {
        return fmt.Errorf("call something: %w", err)
    }
}
func middle() error {
    err := inner()
    if err != nil {
        return fmt.Errorf("middle process: %w", err)
    }
}
func outer() {
    err := middle()
    if err != nil {
        slog.Error("outer process failed", "err", err)
        return
    }
}

PRO TIP

The rule is: return errors up the stack with context added; log them once at the outermost layer where they become terminal (a request handler, a job runner, a main function).

The "ERROR does not mean alert" rule

This is where teams most commonly get log levels wrong: they set up alerts on count(level="error") and then wonder why they have alert fatigue.

# BAD ALERT — flaps constantly
sum(rate(log_entries{level="error"}[5m])) > 0

A single ERROR per hour is probably fine. 100 per second is probably bad. Alerts should be on rates and ratios, not presence.

# BETTER — error rate relative to traffic
sum(rate(log_entries{level="error"}[5m]))
  /
sum(rate(http_requests_total[5m]))
  > 0.01

Even better: alert on the symptom (request error rate, latency p99), not on the log level. We will cover this in Module 6.

Level mapping cheat sheet

Situation	Level
Function entered	TRACE (or don't log, use tracing)
SQL query executed	DEBUG
Cache hit/miss	DEBUG (or metric)
Request handled successfully	INFO
Background job completed	INFO
Deployment started / succeeded	INFO
User sent invalid input (400)	INFO (or metric)
Retry was needed but succeeded	WARN
Fell back to secondary path	WARN
Config value approaching limit	WARN
Deprecated API called	WARN
A 5xx returned by your service	ERROR
A downstream call failed permanently	ERROR
Database write rejected	ERROR
Background job exhausted retries	ERROR
Panic / unrecoverable process state	FATAL

Log levels and sampling

As volume grows, you may need to sample some levels. A reasonable pattern:

DEBUG: never shipped to central store; only to stdout for the specific pod being debugged.
INFO: 100% for now; consider head sampling (e.g. 10% of successful requests) if volume spikes.
WARN / ERROR: always 100%. Never sample errors.

WARNING

Never sample ERROR logs. If you miss the one error that mattered, you will never know. Sample INFO; keep WARN and ERROR complete.

Quiz

KNOWLEDGE CHECK

Your service calls a flaky recommendation API. If the first call fails, your code automatically retries up to 3 times. The second attempt succeeds. What level should you log this at?

What to take away

Log levels are a contract: everyone on the team needs to agree on what each level means.
DEBUG: diagnostic detail, disabled in prod, enabled selectively.
INFO: the audit trail of business events. Keep this story-readable.
WARN: unusual but handled. The level most teams underuse.
ERROR: a thing that should have worked, did not. Log once per failure, at the boundary.
Do not alert on level=error. Alert on symptoms (error rate, latency p99).
Never sample ERROR; consider sampling INFO when volume grows.

Next lesson: log aggregation, retention, and the cost controls that keep your log bill from doubling every quarter.

Structured Logging

Continue

Log Aggregation and Cost

←→ navigateM toggle sidebar