etcd Operations Masterclass

The Data Model

etcd looks simple from the outside: key → value. That's right. What makes it genuinely interesting is the surrounding machinery: revisions, leases, watches, transactions, that turn "key-value store" into "coordination primitive for distributed systems."

This lesson covers what etcd's data model actually is, the five concepts you have to understand (keys, values, revisions, leases, watches), how each one maps to Kubernetes behavior, and the specific reasons etcd is NOT a suitable substitute for Postgres.

KEY CONCEPT

etcd's data model is a key-value store with a global revision number. Every change increments the revision. Every key remembers its history. Every client can watch any key and receive events for changes. These four primitives are everything etcd offers, and everything Kubernetes needs.

Keys and values: the basics

A key is a byte string. A value is a byte string. That's it at the data level.

etcdctl put /my/key "my value"
etcdctl get /my/key

Output:

/my/key
my value

Keys are typically structured as paths with / separators, not because etcd cares about hierarchy, but because etcd supports efficient range queries over lexicographically sorted keys.

# Get everything under /registry/pods/
etcdctl get --prefix /registry/pods/

The prefix scan returns every key that starts with the given prefix, in sorted order. This is how Kubernetes lists all pods, all services, all ConfigMaps in a namespace, the prefix determines the category.

Values: size and format

Values are arbitrary bytes, but they have a size limit. The default is 1.5 MB per value, configurable via --max-request-bytes.

etcd doesn't care what's in the value. For Kubernetes, values are protobuf-encoded (covered in lesson 1.3). For other applications, they might be JSON, plain text, or encrypted blobs.

Keys also have practical size limits, usually well under 1 KB. Put structured data in the value, not the key.

WARNING

etcd is not optimized for large values. Storing 1 MB Secrets works but is a yellow flag; storing many-MB blobs via workarounds is a red flag. If your workload involves large values, the answer is object storage (S3, MinIO) with metadata in etcd, not stuffing blobs in etcd.

Revisions: the global clock

Every change to any key in etcd gets a global, monotonically-increasing revision number. This is a single integer that counts total edits across the entire key space.

etcdctl put /foo "first"
# OK — revision might be, say, 42

etcdctl put /bar "something"
# OK — revision 43

etcdctl put /foo "second"
# OK — revision 44 (note: same key, new revision)

Every put, delete, and transaction has a revision. Even operations on completely unrelated keys advance the same counter.

Why this matters

The global revision is the coordination primitive. It lets you:

Watch from a specific point: "tell me everything that changed after revision X."
Read historical state: "what did /foo hold at revision 42?"
Compare-and-swap: "update /foo only if its current revision equals X."
Optimistic concurrency: "check the revision when you read; write only if it hasn't moved."

Per-key metadata

Each key has four associated revision numbers:

Run etcdctl get /foo -w json to see all of them:

{
  "key": "/foo",
  "create_revision": 42,
  "mod_revision": 44,
  "version": 2,
  "value": "second"
}

MVCC: multi-version concurrency control

etcd keeps old revisions around. You can read historical state:

etcdctl get /foo --rev=42
# Returns: first

This isn't a side effect, it's the primary mechanism that lets watchers catch up from any point. It also powers consistent snapshots for backups.

The downside: the database grows. etcd periodically compacts old revisions to keep the DB from growing forever. Compaction is a key concept for production etcd, covered in lesson 5.1.

Leases: keys with TTL

A lease is an etcd object that has a TTL (time-to-live). Keys can be associated with a lease. When the lease expires, all keys attached to it are automatically deleted.

# Create a 30-second lease
etcdctl lease grant 30
# lease 694d63e0... granted with TTL(30s)

# Attach a key to it
etcdctl put /my/session "active" --lease=694d63e0...

# The key auto-deletes after 30 seconds unless renewed
etcdctl lease keep-alive 694d63e0...
# Keeps the lease alive

Why leases exist

Distributed coordination needs ephemeral state: "I'm alive," "I hold the lock," "I'm the leader of this service." If the process dies, the state must be cleaned up automatically, nobody else knows it's dead.

Leases solve this: the process must actively renew (keep-alive) the lease. If it stops, the TTL runs out, and associated keys vanish.

Kubernetes' use of leases

Every coordination primitive in K8s uses leases under the hood:

Node heartbeats: kubelets renew a Lease object every ~10 seconds. If the lease expires, the node is marked NotReady.
Leader election for controller-manager, scheduler, and custom controllers: each candidate tries to acquire a lease on a well-known key. The holder renews; if it fails, another candidate takes over.
Pod termination: actually uses finalizers and kubelet heartbeats, but leases are the underlying building block.

Leases are why Kubernetes works when pods crash: the ephemeral state auto-cleans.

Watches: the notification system

A watch is a long-lived subscription to changes on a key or range. When the key changes, etcd pushes events to the watcher.

etcdctl watch /registry/pods/ --prefix
# Watches every pod change in the cluster.
# Prints events as they happen:
#   PUT /registry/pods/default/my-pod
#   DELETE /registry/pods/default/old-pod

The watch protocol

Under the hood, a watch uses a gRPC stream. The client opens a stream, sends a watch request, and receives events. Crucially, you can watch starting from a specific revision:

etcdctl watch --rev=42 /registry/pods/ --prefix

This returns every change from revision 42 onward, including ones that happened before the watch was opened. This is how clients reconnect after a network blip without missing events: save the last-received revision, reconnect, resume from there.

Kubernetes' use of watches

The Kubernetes API server is an etcd watch client on steroids. The controller-manager, scheduler, kubelets, and every informer in every client library maintain watches on etcd-backed resources:

"Tell me about every new pod that lands in my namespace."
"Tell me about every node that goes NotReady."
"Tell me about every ConfigMap change."

When a controller says "I'll reconcile this," what's actually happening is a watch event pushes the change down, the controller's work queue picks it up, and the reconcile loop runs.

Watches are why Kubernetes feels real-time. Without them, every component would have to poll, and the control plane would be slow and load the etcd cluster heavily.

KEY CONCEPT

The single biggest reason etcd is the right store for Kubernetes, and why other databases don't work, is the watch system. Postgres can't push changes to thousands of clients with linearizable semantics. etcd does this as a first-class feature.

Transactions: the atomic building block

etcd supports multi-operation transactions with conditional logic:

IF   (conditions)
THEN (success ops)
ELSE (failure ops)

Example: "create this key only if it doesn't already exist":

etcdctl txn
# compares:
create_revision("/my/key") = 0
# success:
put /my/key "new value"
# failure:
get /my/key

If /my/key doesn't exist, create_revision is 0, and the put runs. Otherwise, the fallback get runs. Atomic.

Why this matters

Transactions are how Kubernetes does optimistic concurrency control:

API server reads object from etcd → gets mod_revision = 42.
Client updates object, sends to API server.
API server sends transaction: "IF mod_revision = 42 THEN put new value ELSE fail."
If another writer got there first, mod_revision is now 43, the transaction fails with a conflict, and the client has to retry with the newer version.

This is the resourceVersion you see on every Kubernetes object, it's the etcd mod_revision. Returning a 409 Conflict on an update? That's an etcd transaction failure bubbling up.

What etcd is NOT

Given all that power, it's tempting to put other things in etcd. Don't. Here's why:

etcd is not a general-purpose database

No SQL. Only key lookup and range scans.
No indexes on values. Can only query by key.
No joins or relational anything.
No full-text search.
Values are opaque bytes from etcd's perspective.

For any app that needs relational queries, aggregations, or secondary indexes, use Postgres or another real database.

etcd doesn't scale like a database

The global revision counter and the leader bottleneck mean etcd's write throughput is modest: a few thousand writes per second on fast disks, much less on slow ones.

Reads scale better (followers can serve them serializably), but the whole cluster replicates every write. A workload with millions of writes per day will push etcd hard.

etcd has data size limits

The default max database size is 8 GB. You can raise it (up to ~16 GB is common, higher is risky) but you can't store arbitrary data volumes.

For Kubernetes, this is plenty for tens of thousands of pods. For general-purpose use, it's a hard constraint.

WARNING

A common mistake: "we already run etcd for K8s, let's use the same etcd for our app's state." Do not do this. Your app's writes will compete with kubelet heartbeats, controller updates, and pod lifecycle events. If your app causes etcd issues, your entire Kubernetes cluster goes down. Always run separate etcd clusters per use case.

A mental model for etcd

A worked example: how K8s creates a pod

Walk through the lifecycle, tying every piece to the data model we just covered:

User runs kubectl apply -f pod.yaml.
API server validates, sets resourceVersion: 0 (new object).
API server puts /registry/pods/default/my-pod in etcd with protobuf-encoded value.
etcd commits: new revision, say 1001. The key's create_revision = 1001, mod_revision = 1001, version = 1.
etcd notifies watchers: the scheduler and controller-manager both get PUT events for this key.
Scheduler picks a node. Sends UPDATE to API server with spec.nodeName set, including resourceVersion: 1001.
API server sends transaction: "IF mod_revision = 1001 THEN put new value."
If no conflict, new revision 1002. Watchers see the UPDATE event.
Kubelet on the target node has a watch running. Sees the PUT event, reconciles, starts the container.
Kubelet updates the Pod status: sets status.phase = Running. Another PUT, another revision.

Every step uses the data-model primitives: key-value puts, transactions with revision conditions, watches for notifications, leases (not shown, the kubelet uses them for node heartbeats).

No SQL. No schema. Just keys, revisions, and watches.

Quiz

KNOWLEDGE CHECK

Your team wants to store application configuration in etcd so it's watchable and highly available. The config includes a 4MB JSON file containing all of your company's feature flags, rarely updated. Is this a good use of etcd?

What to take away

etcd's data model: keys (byte strings) → values (byte strings), plus a global revision counter, leases, watches, and transactions.
Every change gets a unique, monotonically increasing revision. This is the cross-cluster clock.
Per-key metadata (create_revision, mod_revision, version) powers optimistic concurrency.
Leases give keys TTLs: the foundation for heartbeats, leader election, and auto-cleanup.
Watches push change events to clients. Kubernetes' real-time behavior depends entirely on this.
Transactions provide compare-and-swap, enabling K8s's resourceVersion conflict detection.
etcd is not a general-purpose database. Small, critical metadata only. No SQL, no big values, no sharing with app workloads.

Next lesson: how Kubernetes actually lays out its objects inside etcd: protobuf, the /registry/ hierarchy, and using etcdctl to explore.

etcd as a Distributed Key-Value Store

Continue

How Kubernetes Stores Data in etcd

←→ navigateM toggle sidebar