Observability Fundamentals for Engineers

Log Aggregation and Cost

If you work at a company spending more than a few thousand dollars a month on observability, it is almost certainly logs, not metrics, not traces, that dominate the bill. Log volume grows with traffic, log volume grows with pods, log volume grows with every engineer who adds an INFO line because "it might be useful later."

This lesson is about the aggregation pipeline, the sampling strategies that keep cost under control, and the retention tiers that let you keep what matters without paying for what you will never look at again.

KEY CONCEPT

Log cost is not an infrastructure problem, it is an engineering discipline problem. Every INFO log that runs on every request is a recurring monthly bill until someone removes it. Treat log lines like database rows: write them deliberately.

The log aggregation pipeline

Every production logging system has four stages:

1. The app

Applications write JSON to stdout. That is it. They do not ship logs themselves. They do not write to files. They do not block on the log pipeline being reachable.

WARNING

Never let logging block the critical path. A logger that does synchronous network I/O will freeze your service when the log endpoint is down. All production loggers buffer locally and flush asynchronously.

2. The collector

A DaemonSet on every node (typically Vector, Fluent Bit, or the OTel collector). It tails the stdout of every pod, parses JSON, adds Kubernetes metadata (pod, namespace, node, labels), optionally filters or samples, and ships onward.

The collector is where you implement sampling and drop rules, before data crosses the wire to the expensive storage tier.

3. The buffer

A durable queue: usually Kafka, Kinesis, or S3-as-buffer, that absorbs traffic spikes and decouples collection from storage. Skippable for small deployments; critical above roughly 1M logs/minute.

4. Storage and query

Where logs are indexed and searchable. Options:

Loki: index only labels, full-text on-the-fly. Cheap for high volumes.
Elasticsearch / OpenSearch: index everything. Fast queries, expensive storage.
Splunk / Datadog / Sumo Logic: managed, feature-rich, most expensive per GB.
S3 + Athena / ClickHouse: DIY, cheap, slower queries, good for cold data.

Why log volume dominates the bill

A rough cost model for a mid-sized production service:

1,000 pods
× 50 log lines per second per pod (INFO + WARN + ERROR)
× 1 KB per line (structured JSON)
= 50 MB/s ingested
= ~4.3 TB/day
= ~130 TB/month

At $2/GB ingested (typical managed pricing):
= $260,000/month

Compare to metrics for the same fleet:

1,000 pods × 2,000 active series × ~$0.005 per series/month = $10,000/month

Logs are 25× the cost of metrics for the same fleet, and that is with reasonable per-pod volume. The first time you see this math, it changes how you think about log lines.

KEY CONCEPT

The single biggest cost control in observability is being deliberate about which INFO logs you write. A log line that runs on every request across 1000 pods at $2/GB is a ~$500/month bill. That INFO you added "just in case" is real money.

Sampling strategies

When volume grows past your budget, the options are to drop, sample, or compress.

Head sampling: decide at generation time

The service decides whether to log before it emits the line. Simplest to reason about, but loses data you cannot get back.

// Log only 10% of successful requests, all errors
if rng.Float64() < 0.1 || status >= 500 {
    slog.Info("request completed", ...)
}

Tail sampling: decide at the collector

The collector sees all logs and decides which to ship. More expensive (you still pay for collection) but lets you sample based on context, e.g. always keep the full trace if any log in the trace was an error.

# Vector config — keep all errors + 10% of everything else
[transforms.sample]
type = "sample"
inputs = ["parse"]
rate = 10
exclude.level = "error"

Rate-based sampling: cap emissions per second

Useful for preventing a runaway log loop from costing $50,000 in an hour.

var limiter = rate.NewLimiter(100, 200) // 100 logs/sec, burst 200
if limiter.Allow() {
    slog.Info(...)
}

Aggregation: count, don't log

If you are logging the same event thousands of times a second, you probably want a metric, not a log:

// WRONG — log every cache hit
slog.Info("cache hit", "key", key)

// RIGHT — increment a counter
cacheHits.Inc()

Retention tiers

Not every log needs to be queryable for 90 days. Tier your storage:

Tier	Duration	Cost	What it's for
Hot	3-7 days	$$$	Incident response, real-time queries
Warm	7-30 days	$$	Week-over-week comparison, weekly reviews
Cold	30-365 days	$	Compliance, audits, post-mortems
Archive	1+ year	¢	Legal hold, rare historical queries

Most engineers query the last 24 hours 95% of the time. The last 7 days 99% of the time. Anything older is rare enough that slower queries against cheaper storage are fine.

PRO TIP

A good first move when your log bill is spiralling: cut hot retention to 3 days, move everything else to S3-backed cold storage. Your query latency goes up 10×, your bill goes down 80%, and you almost never notice because you almost never query cold logs.

What to drop at the collector

Collectors let you drop logs before they hit expensive storage. Things worth dropping unconditionally:

Health check requests: GET /healthz, GET /ready, GET /metrics. Noise.
Bot traffic: GoogleBot, AhrefsBot, etc. Useful as metrics, not as individual logs.
Known harmless errors: client disconnects, connection reset errors from load balancer probes.
Duplicate / noisy sidecars: istio-proxy logs, linkerd-proxy logs.

# Vector — drop health check noise
[transforms.filter]
type = "filter"
inputs = ["parse"]
condition = '!includes(["/healthz", "/ready", "/metrics"], .path)'

The Loki-style label cardinality trap

If you use Loki (or other label-indexed log stores), labels have the same cardinality problem as Prometheus metrics.

# WRONG — request_id as a label = one stream per request
{service="api", request_id="abc123"}

# RIGHT — request_id in the log body, not in labels
{service="api"} | json | request_id="abc123"

Loki labels should be like Prometheus labels: service, namespace, pod, level. High-cardinality fields (trace_id, user_id, request_id) go in the log body and are queried with filtering.

WARNING

The same rule as metrics: labels are for bounded, enumerable values. Unbounded values go in the body. Violating this turns Loki into a list of millions of tiny, useless streams.

The PII and secrets problem

Logs end up in databases that dozens of people have access to. They get shipped to third-party vendors. They are retained for compliance periods measured in months.

Assume every log line is public. Anything you would not want to show up in a Datadog dashboard or a compliance audit needs to be redacted at the source, not in post-processing.

// Go — redact at construction
func logSafeUser(u User) slog.Attr {
    return slog.Group("user",
        "id", u.ID,             // OK — opaque identifier
        "tier", u.Tier,         // OK — bounded value
        // deliberately omitting email, name, phone, address
    )
}

Collector-level redaction (with regex for common patterns) is a useful second line, but it is a safety net, not the primary control.

The 90/10 rule

90% of log usage is looking at the last 24 hours. 10% is everything else. Design your pipeline around that:

Hot tier (last 24-72h): fast search, indexed, expensive. Small.
Warm tier (last 30d): searchable but slower, cheaper storage. Medium.
Cold tier (90d+): S3 or equivalent. Queried via Athena / Loki compactor. Large but cheap per GB.

Most log budgets go wrong by keeping everything at hot-tier pricing for 90 days because no one set up the tiering.

War story: the 8× bill jump

A team I worked with had a stable logging bill of ~$20k/month. Over one quarter it grew to $160k. Nothing had changed in volume, or so they thought.

The culprit: one engineer had added a new INFO log in a hot code path. The line included the full request body in the raw field. Each line was now ~30KB instead of ~1KB. Multiplied by 3,000 RPS across 500 pods across 90-day hot retention.

We caught it by running this query:

topk(10,
  sum by (service) (rate(log_entries_bytes[1h]))
)

Once identified, the fix was two lines: redact raw to its length and content type only. Bill fell back to $22k.

KEY CONCEPT

Observe your observability. Track log volume as a metric per service. Alert when a service's log volume grows more than 2× week-over-week.

Quiz

KNOWLEDGE CHECK

Your team is alarmed by the log bill, it has grown from $10k/month to $80k/month in 6 months. The engineering count has not changed much. What is the most likely cause, and what should you check first?

What to take away

Logs are typically 10-25× more expensive than metrics for the same fleet. Treat every INFO line like recurring infrastructure cost.
The pipeline: app writes JSON to stdout → DaemonSet collector → buffer (optional) → storage. Never let logs block the critical path.
Sampling strategies: head (cheap, loses data), tail (expensive, smart), rate-limited (prevents runaway cost), aggregation (replace with a metric).
Retention tiers: hot (days), warm (weeks), cold (months), archive (years). Most queries hit the last 24 hours.
Drop noise at the collector: health checks, bots, harmless errors.
Observe your observability: track log volume per service and alert on 2× growth.

Next module: distributed tracing, the pillar that makes the metrics → logs → traces flow complete.

Log Levels and What They Mean

Continue

What Tracing Is

←→ navigateM toggle sidebar