etcd Operations Masterclass

etcd as a Distributed Key-Value Store

Every Kubernetes cluster has one piece of infrastructure that, if you lose it, you lose the cluster. Not the API server, you can rebuild that. Not the scheduler, that's stateless. Not even the nodes, K8s can reschedule pods.

It's etcd. Every pod, every ConfigMap, every Secret, every Service, every Deployment, they all live in etcd. The API server is a client of etcd. The scheduler, the controllers, even kubectl all eventually read or write etcd through the API server. Kill etcd and you kill Kubernetes.

This lesson is the foundation: what etcd actually is, how Raft consensus makes a distributed key-value store reliable, and why the "odd number of members" rule isn't arbitrary.

KEY CONCEPT

etcd is a strongly-consistent, highly-available distributed key-value store. Strongly-consistent means every successful write is immediately visible to every read. Highly-available means the cluster keeps working if a minority of members fail. Both properties come from one algorithm: Raft.

What etcd actually is

Strip away the Kubernetes connection and etcd is a small program with a simple job:

Accept writes: SET /some/key = some_value
Accept reads: GET /some/key
Accept watches: "tell me when /some/key changes"

What makes etcd interesting: and hard: is doing those three things across multiple machines, without ever returning stale or inconsistent data, even when machines crash or the network partitions.

A single-machine database can give you strong consistency easily (one source of truth, one writer). The hard part is doing that with replicas so you don't lose everything when a server dies.

The distributed systems problem

Imagine three machines, each holding a copy of your data. A client writes "key X = 5" to machine A. Before A finishes replicating to B and C, the network partitions A off.

Now:

Client 1 (connected to B): reads key X, gets... old value or new value?
Client 2 (connected to A): writes key X = 7. Which machine is right?

Without a protocol, you have two versions of truth. When the network heals, which wins? And if clients already read different values, how do you tell them the other value "didn't really happen"?

This is the problem Raft solves.

Raft in one paragraph

KEY CONCEPT

Raft designates one member as the leader. All writes go through the leader. The leader replicates each write to followers and waits for a majority (quorum) to acknowledge before committing. If the leader fails, the remaining members hold an election, the first to get a majority of votes becomes the new leader.

That's the whole algorithm in a paragraph. The rest is edge cases and proofs.

Three properties fall out:

Strong consistency: only the leader writes, and only after a quorum acknowledges. There's no "two writers disagree" scenario.
Availability: as long as a majority of members are up and can talk to each other, the cluster works.
Fault tolerance: a minority of failed or partitioned members never break the cluster.

Raft states: leader, follower, candidate

Every etcd member is always in one of three states:

In a healthy cluster, one member is leader and the rest are followers. Candidates only appear during elections, which should be rare.

Leader election

When a follower doesn't hear from the leader for longer than its election timeout (default ~1 second), it converts to candidate and starts an election:

Increments its current term (a monotonically increasing number).
Votes for itself.
Sends RequestVote RPCs to every other member.
Other members vote yes if (a) they haven't voted this term already and (b) the candidate's log is at least as up-to-date as theirs.
If the candidate gets a majority of votes, it becomes leader.
New leader immediately sends heartbeats to reset everyone's election timers.

Why terms matter

A term is like a generation number. Every time an election starts, the term increments. Messages carry a term number. If a leader sees a higher term, it knows a newer leader has emerged and steps down.

Terms prevent split-brain during partitions: a partitioned-off old leader eventually rejoins, sees the current term is higher, and converts back to follower before it can do damage.

What can go wrong

Two candidates, split vote: neither gets a majority. Both wait a randomized timeout, then one tries again.
Leader partitioned off, minority side: its members keep heartbeating among themselves but can't reach quorum. They never elect a new leader. The majority side elects one. When partition heals, the old leader sees the new term and steps down.
Leader network flake: heartbeats start missing. Followers start elections. If the flake was transient, the old leader may win re-election; if not, a new leader takes over.

WARNING

Leader changes are visible and expensive. Every change requires logging, committing to disk, and triggering a cluster-wide state sync. Frequent leader changes (>1 per minute) indicate serious network or disk problems, covered in the troubleshooting module.

Log replication

Every write to etcd flows through this pipeline:

Key property: a write is only "committed" once a majority has durably written it. This is why the cluster can lose a minority and not lose data, any committed entry is already on enough machines that a majority-forming survivor group has it.

The two-phase commit story

Raft's write commit is effectively a two-phase process:

Log append (phase 1): leader writes to its own log, sends to followers. Each follower appends to its log and responds.
Commit (phase 2): once majority have appended, leader marks the entry committed. Applies it to the state machine (the actual key-value store). Returns success to the client. On next heartbeat, followers learn the commit index has advanced and apply the entries too.

The time between phase 1 and phase 2 is where the etcd_disk_wal_fsync_duration_seconds metric matters, it's the time to durably write to disk on each member. If that's slow, commits are slow, all writes feel slow.

Why an odd number of members?

This is the question every new etcd operator asks. Why 3, 5, or 7, not 4 or 6?

The math

Quorum = (N/2) + 1 where N is the number of members.

Members	Quorum	Tolerates failure of
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

Notice: 3 members tolerate 1 failure. 4 members also only tolerate 1 failure. Adding the 4th member gives you no additional fault tolerance: but adds cost, latency (more members to replicate to), and risk of split-brain-ish scenarios during partitions.

The cost of even numbers

For an even number like 4, if you split exactly in half (2-2), neither side has quorum. Both sides are unavailable. With 3 members, an asymmetric split (1-2) guarantees one side has quorum.

In short: even-numbered clusters get all the cost of an extra node with none of the benefit.

The practical sizes

1 member: no fault tolerance. Only for dev/test.
3 members: the production default. Tolerates 1 failure. Cheapest HA config.
5 members: tolerates 2 simultaneous failures. Used for larger clusters or across more availability zones.
7 members: rare. Tolerates 3 failures. Cross-region or extremely mission-critical setups.
9+: diminishing returns. More members = more replication overhead = slower writes. Almost never justified.

PRO TIP

99% of production etcd deployments are 3 or 5 members. Start with 3; move to 5 if you need cross-zone fault tolerance or have observed back-to-back failures that nearly took the cluster down.

Reads in etcd

Writes go through the leader. Reads can be served from followers too, but with a subtlety.

Linearizable reads (default)

The default read path is linearizable: the client sees results that are consistent with a single serial order of operations. This requires the read to be routed to (or coordinated with) the leader.

etcdctl get /foo
# Default: linearizable, requires leader contact

Serializable reads (opt-in)

If you can tolerate slightly-stale data, you can ask for a serializable read that any member can serve without contacting the leader:

etcdctl get --consistency=s /foo
# Serializable: returns whatever this member has locally

Faster (no leader round-trip) but can return data that's up to one full replication cycle old.

Kubernetes almost always uses linearizable reads because stale data in the scheduler would lead to chaos. You should too, unless you have a specific reason.

The consequences for Kubernetes

Every kubectl apply, every pod creation, every scheduler decision becomes:

API server validates and parses.
API server calls etcd to put /registry/<object path>.
etcd's leader appends to the log.
Leader replicates to followers.
Quorum writes to disk.
Leader commits, returns success to the API server.
API server acknowledges the client.

Latency = fsync latency + network latency × 2 + any queueing.

If etcd is slow, every Kubernetes control-plane operation is slow. kubectl apply takes 5 seconds. Scheduling takes 10 seconds. Controllers can't keep up. The entire cluster becomes sluggish.

This is why the later lessons on disk performance (Module 2) and monitoring (Module 4) matter so much: the cluster's responsiveness is bounded by etcd's write latency.

Network requirements

Because every write requires round-trips to followers before committing, etcd is very sensitive to network latency between members.

Rule of thumb:

Same datacenter / same AZ: under 5ms RTT is normal.
Cross-AZ in the same region: 1-5ms RTT, usually fine.
Cross-region: 50-200ms RTT. Every write adds this latency. Painful.

etcd is not designed for wide-area replication. If you need geographic redundancy, you run separate etcd clusters per region and replicate at a higher level (via Velero backups, GitOps, etc.).

Kubernetes clusters should have etcd members within a single region, ideally within a few AZs.

Durability: the WAL

When a write arrives at a member, the member must durably write it before acknowledging. "Durably" means "survives a power loss."

etcd's Write-Ahead Log (WAL) is an append-only file that every log entry goes into, flushed to disk (fsync) before the member responds to the leader.

This is the source of most production etcd pain. Fsync is slow. How slow depends on your disk, which is why Module 2 spends an entire lesson on disk I/O.

Typical fsync latencies:

Local NVMe: 0.5-2ms
Local SSD: 2-10ms
EBS gp3 (AWS): 5-30ms
EFS / NFS: 20-200ms (frequently disastrous)

An etcd with fsync above 25ms on average is an etcd about to have problems.

A glimpse into the future of this course

We'll come back to all of this. The foundation:

Module 2: how to size and tune etcd for your disk.
Module 3: how to back up and restore the thing that holds everything.
Module 4: which metrics matter and how to alert.
Module 5: the three production emergencies every etcd operator will face.
Module 6: upgrades, migrations, and security.

Everything downstream depends on understanding Raft. Knowing why a minority loss is survivable but a majority loss is catastrophic is what makes every troubleshooting decision obvious.

Quiz

KNOWLEDGE CHECK

You have a 3-member etcd cluster. A network partition separates one member from the other two. Which side of the partition continues serving traffic, and why?

What to take away

etcd is a strongly-consistent, highly-available distributed key-value store. Every Kubernetes cluster's source of truth.
Raft consensus provides both consistency and availability: a single leader, replication to followers, writes committed only with quorum.
Odd number of members: 3 or 5 in production. Even numbers add cost with no extra fault tolerance.
Writes flow: client → leader → append WAL → replicate to followers → quorum fsync → commit → respond.
Reads are linearizable by default; Kubernetes relies on this. Serializable reads are faster but can be stale.
Every K8s control-plane operation depends on etcd. Slow etcd = slow cluster.
Fsync latency is the dominant write cost. We'll spend Module 2 on this.
Leader changes should be rare. Frequent changes = network or disk problems.

Next lesson: the data model: keys, values, revisions, leases, watches, and what Kubernetes does with each.

Continue

The Data Model

←→ navigateM toggle sidebar