All posts
Kubernetes Operations

etcd Is Slowing Down Your Cluster: Compaction, Defrag, and the 2GB Wall

Your API server latency p99 is rising. etcd disk usage is creeping toward the 2GB quota. Compaction has run, defrag has not, and your cluster is one write spike away from a no-space-left-on-device outage.

By Sharon Sahadevan··9 min read

You get a Slack ping from monitoring: etcd database size 78% of quota. You ignore it because the cluster is working fine. Two weeks later, the cluster stops accepting writes:

etcdserver: mvcc: database space exceeded

Now kubectl apply fails, deploys are stuck, and your CI pipeline is red across every team. The cluster is read-only because etcd hit its 2GB quota.

This is the etcd quota cliff. It does not slowly degrade; it works perfectly until it stops working completely. The fix is straightforward (compaction + defrag), but if you have not been running them on a schedule, you are walking toward this wall and you do not know when you will hit it.

This post is etcd storage operations: how revisions accumulate, why compaction is not enough on its own, what defrag does, and the schedule that keeps a Kubernetes cluster healthy indefinitely.

What etcd actually stores#

etcd is a key-value store with a multi-version history. Every write creates a new revision; the old version of the key is kept around (so you can etcdctl get foo --rev=12345 to see what foo was at revision 12345). This is what powers Kubernetes' watch API: clients say "I last saw revision N, tell me what changed since."

The cost: every write grows the database. A pod created and then deleted leaves two revisions in etcd. A ConfigMap updated 100 times leaves 100 revisions. The kube-controller-manager reconciliation loop creates revisions constantly even when nothing visibly changes.

A typical busy Kubernetes cluster generates 50,000 to 500,000 revisions per day. Without compaction, etcd grows unboundedly.

The 2GB quota#

By default, etcd's database has a 2GB quota (--quota-backend-bytes). When the database hits this size, etcd transitions to read-only mode and refuses all writes. The Kubernetes API server then starts returning errors on every write.

You can raise the quota:

etcd --quota-backend-bytes=8589934592   # 8GB

But this is treating the symptom, not the cause. Without compaction and defrag, an 8GB quota just buys you weeks before you hit it again. Real fix is operations on the database, not raising the cap.

Compaction: deleting old revisions#

Compaction tells etcd "any revision before this one can be deleted." After compaction, etcdctl get foo --rev=N for an old N returns "compacted." The space is reclaimable.

Three ways to trigger compaction:

1. Auto-compaction by time (the recommended default):

etcd --auto-compaction-mode=periodic --auto-compaction-retention=8h

Every 8 hours, etcd compacts revisions older than 8 hours. You keep 8 hours of history; older history is gone.

2. Auto-compaction by revision count:

etcd --auto-compaction-mode=revision --auto-compaction-retention=10000

Keeps the last 10,000 revisions. Older ones get compacted.

3. Manual compaction:

# Get current revision
REV=$(etcdctl endpoint status --write-out=json | jq -r '.[0].Status.header.revision')

# Compact to that revision
etcdctl compact $REV

Useful for one-off cleanup. Not a substitute for auto-compaction.

Important note: kube-apiserver also has its own compaction (--etcd-compaction-interval, default 5 minutes). On modern Kubernetes, the apiserver triggers compaction. You typically do not need to also configure auto-compaction on etcd; configuring both is harmless but redundant.

Verify compaction is happening:

# In etcd logs, look for messages like:
# "compacted revision XXXX"

journalctl -u etcd | grep -i compact | tail -5

# Or check the metric
curl -s http://etcd:2379/metrics | grep etcd_debugging_mvcc_db_compaction

Defrag: actually reclaiming the space#

Here is the surprise that bites people: compaction does not reduce the on-disk size of the etcd database. Compaction marks old revisions as deletable. The bytes stay on disk as fragmented free space inside the boltdb file.

To actually reclaim space, you need to defrag. Defrag rewrites the database file, packing the live data and freeing the rest.

# Run defrag on each etcd member
etcdctl defrag --endpoints=https://etcd-1:2379

# Or all members at once (with --cluster, but be careful: defrag pauses the member)
etcdctl defrag --cluster

Before defrag:

$ etcdctl endpoint status --write-out=table
+---------+----------+---------+---------+------+--------+
| ENDPOINT | DBSIZE  | DBSIZE_INUSE | ... |
+---------+----------+---------+---------+------+--------+
| etcd-1  | 1.8 GB  | 200 MB       | ... |
+---------+----------+---------+---------+------+--------+

DBSIZE is the file on disk. DBSIZE_INUSE is the live data. The 1.6GB difference is fragmented space that compaction marked as free but defrag has not reclaimed.

After defrag:

+---------+----------+---------+---------+------+--------+
| etcd-1  | 220 MB  | 200 MB       | ... |
+---------+----------+---------+---------+------+--------+

This is the operation that actually keeps you off the 2GB cliff.

The defrag gotcha: it pauses the member#

Defrag locks the etcd member during the rewrite. The member becomes unavailable for reads and writes until it finishes (which takes seconds for a small database, minutes for a large one).

For a single-node etcd, defrag means downtime. For an HA cluster (3 or 5 etcd members), you can defrag one at a time and the quorum keeps serving:

# Defrag etcd-1 (cluster keeps working via etcd-2, etcd-3)
etcdctl defrag --endpoints=https://etcd-1:2379

# Wait until etcd-1 is back in the cluster

# Then etcd-2
etcdctl defrag --endpoints=https://etcd-2:2379

# Then etcd-3
etcdctl defrag --endpoints=https://etcd-3:2379

Never defrag all members at once. You will lose quorum during the operation.

The compaction + defrag schedule#

A working operational schedule:

Compaction: continuous (auto-compaction at 8h retention or apiserver's 5-minute interval, whichever you have configured).

Defrag: weekly, rolling across HA members.

#!/bin/bash
# /usr/local/sbin/etcd-defrag.sh
set -euo pipefail

ENDPOINTS=("https://etcd-1:2379" "https://etcd-2:2379" "https://etcd-3:2379")
ETCDCTL_OPTS="--cacert=/etc/etcd/pki/ca.crt --cert=/etc/etcd/pki/client.crt --key=/etc/etcd/pki/client.key"

for endpoint in "${ENDPOINTS[@]}"; do
    echo "Defragging $endpoint..."
    etcdctl $ETCDCTL_OPTS --endpoints="$endpoint" defrag
    
    # Wait for the member to rejoin and stabilize
    sleep 30
    
    # Verify the member is healthy before moving on
    etcdctl $ETCDCTL_OPTS --endpoints="$endpoint" endpoint health
done

echo "Defrag complete."
# /etc/cron.d/etcd-defrag
0 3 * * 0 root /usr/local/sbin/etcd-defrag.sh >> /var/log/etcd-defrag.log 2>&1

Sunday 3 AM, weekly. Adjust based on your write volume; some clusters need it daily, some monthly.

Detecting an imminent quota hit#

The two metrics that matter:

# Database size as a fraction of quota
etcd_mvcc_db_total_size_in_bytes
/
on(instance) etcd_server_quota_backend_bytes

# In-use vs total (large gap = needs defrag)
(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)
/
etcd_mvcc_db_total_size_in_bytes

Alert rules:

- alert: EtcdDatabaseQuotaWarning
  expr: |
    etcd_mvcc_db_total_size_in_bytes
    / on(instance) etcd_server_quota_backend_bytes > 0.6
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "etcd {{ $labels.instance }} at {{ $value | humanizePercentage }} of quota"

- alert: EtcdDatabaseQuotaCritical
  expr: |
    etcd_mvcc_db_total_size_in_bytes
    / on(instance) etcd_server_quota_backend_bytes > 0.85
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "etcd {{ $labels.instance }} at {{ $value | humanizePercentage }} of quota - quota cliff imminent"

- alert: EtcdDefragNeeded
  expr: |
    (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes)
    / etcd_mvcc_db_total_size_in_bytes > 0.5
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "etcd {{ $labels.instance }} has > 50% fragmented space; defrag needed"

The defrag alert is the leading indicator. If your weekly defrag is not running, this fires before you get close to the quota.

Recovering from "database space exceeded"#

If the cluster has already hit the wall and is read-only:

Step 1: free space immediately. Run compaction explicitly, then defrag, on each member:

# On each etcd node, in sequence
REV=$(etcdctl endpoint status --write-out=json | jq -r '.[0].Status.header.revision')
etcdctl compact $REV
etcdctl defrag

Step 2: clear the alarm. etcd raises a "no space" alarm that persists even after the database shrinks. You must clear it explicitly:

etcdctl alarm list
# NOSPACE alarm should be visible

etcdctl alarm disarm

After this, etcd accepts writes again. The Kubernetes API server resumes operation.

Step 3: post-incident hardening. Set up auto-compaction (if not already), set up the weekly defrag cron, set up the monitoring alerts above.

The other thing eating etcd: events#

In Kubernetes, the highest-volume writes to etcd are typically Events (every pod transition, every scheduler decision, every probe failure generates an event). Events have a default TTL of 1 hour, but the writes still hit etcd before they expire.

For clusters with high event volume, separating events into a dedicated etcd cluster removes the load:

# Apiserver flag
kube-apiserver \
  --etcd-servers-overrides=/events#https://etcd-events-1:2379,https://etcd-events-2:2379 \
  --etcd-servers=https://etcd-1:2379,https://etcd-2:2379

Now /events writes go to the events etcd cluster; everything else goes to the main one. Each cluster manages its own quota independently. Recommended for clusters above ~500 nodes.

Etcd backups also depend on this#

When etcd database grows, your backups grow. A 7GB etcd database means 7GB snapshots, every backup interval, possibly stored for weeks. Backup storage cost and restore time both scale with database size.

Defrag keeps backups manageable. A weekly defrag means your snapshots are tens of MB per backup, not GB.

Quick reference: the etcd storage health checklist#

1. Verify auto-compaction is configured:
   ps aux | grep etcd | grep -E "auto-compaction|etcd-compaction-interval"
   (should show retention setting)

2. Check current size and fragmentation:
   etcdctl endpoint status --write-out=table
   (DBSIZE much larger than DBSIZE_INUSE = needs defrag)

3. Check quota usage:
   etcdctl endpoint status --write-out=json \
     | jq '.[0].Status.dbSizeInUse, .[0].Status.dbSize'

4. Run defrag if fragmentation > 50%:
   etcdctl defrag --endpoints=$ENDPOINT
   (one member at a time in HA setups!)

5. Check for alarms:
   etcdctl alarm list
   (NOSPACE alarm = read-only mode, needs disarm after free)

6. Set up the schedule:
   - auto-compaction at 8h retention (or apiserver's interval)
   - weekly rolling defrag cron
   - alerts on quota usage and fragmentation

7. For high-event clusters:
   - separate /events to its own etcd cluster

What "healthy etcd" looks like#

A healthy etcd cluster in production has these properties:

  • Database size under 1GB (well below 2GB quota)
  • DBSIZE / DBSIZE_INUSE ratio under 1.5 (fragmentation under 33%)
  • Compaction running automatically every few minutes
  • Defrag running weekly with no manual intervention
  • WAL fsync p99 under 25ms (storage layer is fast enough)
  • No alarms active

When all of these are true, etcd disappears from your operational concerns. When any of them are off, you are walking toward the wall.

The mental model#

Compaction tells etcd which revisions can be deleted. Defrag actually reclaims the space. They are two operations, and you need both.

Without auto-compaction, history grows forever. Without periodic defrag, the file size grows forever even with compaction. The 2GB quota is a hard cliff, not a slope. Cluster works perfectly until it does not.

The fix is operational, not architectural: a cron job, two metrics, three alerts. Set them up once and etcd stops being your problem.


The full etcd operations playbook (encryption, mTLS, backups, restore, disaster recovery, sizing) is the entire etcd Operations course. The cluster-level production patterns (etcd separation, control plane sizing, lifecycle management) are part of the Production Kubernetes Operations course.