etcd Operations Masterclass

How Kubernetes Stores Data in etcd

Every kubectl get pods command ultimately reads etcd. Every kubectl apply ultimately writes etcd. But Kubernetes hides etcd behind the API server so completely that most engineers never see the layer below. This lesson strips that abstraction away.

By the end, you'll know exactly where a ConfigMap lives in etcd, what encoding is on the wire, how to read it with etcdctl, and why this knowledge matters during outages when the API server is unavailable.

KEY CONCEPT

Every Kubernetes object lives at a predictable path in etcd, /registry/<kind>/<namespace>/<name>: serialized as protobuf. Knowing this lets you inspect state directly during incidents, verify backups, migrate clusters, and understand exactly what's using your etcd capacity.

The /registry/ hierarchy

Every Kubernetes object in etcd lives under the /registry/ prefix. The structure is straightforward:

/registry/<resource_type>/<namespace>/<name>

Examples:

/registry/pods/default/my-pod
/registry/configmaps/kube-system/coredns
/registry/secrets/default/db-credentials
/registry/deployments/apps/api-server
/registry/services/default/kubernetes
/registry/namespaces/default

Cluster-scoped resources (no namespace) skip the namespace component:

/registry/nodes/worker-1
/registry/clusterroles/cluster-admin
/registry/persistentvolumes/pv-100g-1

Custom resources follow the same pattern:

/registry/cert-manager.io/certificates/default/my-tls-cert
/registry/argoproj.io/applications/argocd/my-app

This structure makes prefix scans the natural query operation: want all pods in namespace default? Scan /registry/pods/default/. Want every Deployment cluster-wide? Scan /registry/deployments/.

Key-space layout at a glance

Object serialization: protobuf

The value at each /registry/... key isn't JSON. It's protobuf, a compact binary serialization format.

Why protobuf over JSON:

Smaller: roughly half the size of equivalent JSON.
Faster to parse: binary fields, no text scanning.
Strongly typed: every field has a schema, no ambiguity.

The trade-off: you can't just cat an etcd value and read it. You need a protobuf deserializer that understands Kubernetes' schemas.

The `k8s\x00` magic prefix

Every Kubernetes protobuf value starts with a magic prefix:

k8s\x00<protobuf-bytes>

The first four bytes are literally k, 8, s, \x00. This tells the API server "this is a Kubernetes protobuf, here's the payload." It also distinguishes Kubernetes-encoded values from other formats the API server might encounter (e.g., if encryption-at-rest is enabled, it's prefixed differently).

PRO TIP

If you ever see data in etcd that starts with k8s\x00, you're looking at Kubernetes-encoded protobuf. Anything else (like {...} JSON, or <encrypted>...) means something else is happening, often legacy JSON objects or encryption at rest.

Using etcdctl to read K8s data

On a control-plane node, you have access to the etcd endpoints and client certificates. Here's how to list resources directly:

# Set up environment (adjust paths to your install)
export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

Now you can query:

# List all pod keys (not values, just keys)
etcdctl get --prefix --keys-only /registry/pods/

# List all namespaces
etcdctl get --prefix --keys-only /registry/namespaces/

# Count ConfigMaps in kube-system
etcdctl get --prefix --keys-only /registry/configmaps/kube-system/ | grep -c /registry

# Get a specific pod's raw protobuf value
etcdctl get /registry/pods/default/my-pod

The last command returns binary output. To make it readable, pipe through a decoder.

Decoding protobuf values

The auger tool (maintained by Kubernetes community) converts the protobuf back to YAML:

# Install auger
go install github.com/jpbetz/auger@latest

# Decode a pod
etcdctl get /registry/pods/default/my-pod --print-value-only \
  | auger decode

Output: the full Pod spec as YAML, identical to kubectl get pod my-pod -o yaml (plus some internal fields).

An alternative using Python:

etcdctl get /registry/pods/default/my-pod --print-value-only \
  | head -c 4   # Shows "k8s\0"

So the magic prefix is the first 4 bytes. The rest is protobuf you can decode with the k8s.io/apimachinery protobuf definitions, but that's usually more trouble than using auger.

Storage format for special resources

A few resources have quirks worth knowing about.

Events

Events live at:

/registry/events/<namespace>/<event-name>

Events have a short TTL (~1 hour default) via etcd leases, they auto-delete. This is why kubectl get events only shows recent events.

Events are high-volume and can dominate etcd write load. A misbehaving controller generating events at 10/sec will burn through capacity.

Leases

Lease objects themselves are stored at /registry/leases/<namespace>/<name>. Confusingly, this is different from etcd's native lease primitive (the one covered in lesson 1.2).

etcd lease (primitive): a TTL on a key, internal to etcd.
Kubernetes Lease object: a user-facing resource used for coordination and heartbeats.

The Kubernetes Lease object does not use etcd's lease primitive directly. Instead, it's a regular etcd key with a timestamp field, and kubelets/controllers update the timestamp to "renew." The node controller checks timestamps to decide if a node is alive.

Secrets

Secrets are just keys under /registry/secrets/. By default, the value is plain protobuf containing the secret data (base64-encoded inside the protobuf, which is not encryption, just encoding).

This is the #1 reason to enable encryption at rest in Kubernetes: without it, anyone with read access to the etcd database file (e.g., a stolen backup) can read every Secret in the cluster.

With encryption at rest enabled, the value format becomes:

k8s:enc:<provider>:<version>:<encrypted-payload>

ConfigMaps

Same as Secrets but without the implicit expectation of secrecy. Plain protobuf, no encryption unless you explicitly enable it for ConfigMaps (less common).

Custom resources (CRDs)

CRDs themselves live at:

/registry/apiextensions.k8s.io/customresourcedefinitions/<name>

Instances of custom resources live under the CRD's group:

/registry/<group>/<resource>/<namespace>/<name>

# e.g. for cert-manager's Certificate CRD (group: cert-manager.io):
/registry/cert-manager.io/certificates/default/my-tls-cert

One implication: every CRD you create adds to the etcd key space. Teams that spray CRDs across their cluster (Argo, Flux, cert-manager, external-dns, ingress-nginx admission, OPA Gatekeeper, etc.) can have tens of thousands of custom resources, each a key in etcd.

A worked example: finding a ConfigMap in etcd

Say you want to verify that the coredns ConfigMap in kube-system is actually what kubectl says it is. Full walkthrough:

# 1. Check with kubectl
kubectl get configmap -n kube-system coredns -o yaml
# ... shows the ConfigMap contents ...

# 2. Find the etcd key
etcdctl get --keys-only --prefix /registry/configmaps/kube-system/coredns
# /registry/configmaps/kube-system/coredns

# 3. Fetch the raw value
etcdctl get /registry/configmaps/kube-system/coredns --print-value-only | auger decode
# ... the same ConfigMap contents as kubectl showed ...

# 4. Check metadata
etcdctl get /registry/configmaps/kube-system/coredns -w json | jq '.kvs[0] | {create_revision, mod_revision, version}'
# {
#   "create_revision": 1023,
#   "mod_revision": 1023,
#   "version": 1
# }

The version tells you how many times this ConfigMap has been modified, 1 in this case (freshly created).

If an object in kubectl doesn't match what's in etcd, you have a serious bug (admission webhook misbehaving, mutating controller gone wrong, corrupted data). Being able to check this directly is the escape hatch.

How much space is each resource using?

Useful admin query: which resources are hogging etcd?

# Get key counts by resource type
for kind in pods configmaps secrets services deployments daemonsets events; do
  count=$(etcdctl get --prefix --keys-only /registry/$kind/ | grep -c /registry)
  echo "$kind: $count"
done

Output:

pods: 512
configmaps: 147
secrets: 89
services: 42
deployments: 38
daemonsets: 8
events: 1203

If events dominate and the DB is big, your retention is off (or a controller is event-spamming).

Approximate value-size query

A more advanced query uses etcd-analyze or custom scripts. For a quick estimate:

etcdctl get --prefix /registry/pods/ | wc -c
# Total bytes of all pod keys + values

Across a cluster, this helps you understand where the bytes are going.

Watch with etcdctl

You can watch etcd keys directly, just like Kubernetes components do internally:

# Watch every change to pods in the default namespace
etcdctl watch --prefix /registry/pods/default/

Modify a pod in another terminal (kubectl scale deployment/foo --replicas=3) and you'll see the PUT events stream by in real time. This is literally what the scheduler and kubelets see.

Fascinating for learning; invaluable for debugging "is this pod actually getting created in etcd?" type issues.

The cluster-resource inventory

A useful one-liner to see what types of resources exist in your cluster:

etcdctl get --prefix --keys-only / | awk -F/ '{print $3}' | sort -u

Output for a typical cluster:

apiregistration.k8s.io
apps
batch
certificates.k8s.io
configmaps
controllerrevisions
coordination.k8s.io
daemonsets
deployments
endpoints
events
leases
namespaces
nodes
persistentvolumes
pods
replicasets
secrets
serviceaccounts
services
statefulsets
... plus any CRD groups you've installed

Each of those represents a type of object stored under /registry/<type>/....

What the API server does with all this

The API server is essentially a typed, validating, caching proxy over etcd's key-value interface. Its layers:

Admission (validation, mutation).
Serialization (JSON/YAML ↔ protobuf).
Storage (etcd get/put/watch/transaction).
Cache (an in-memory watchCache for hot resources).

When kubectl says GET /api/v1/namespaces/default/pods/my-pod, the API server:

Authenticates.
Checks RBAC.
Runs admission.
Turns the REST URL into the etcd key /registry/pods/default/my-pod.
Does etcdctl get equivalent.
Decodes protobuf.
Converts to JSON/YAML for the response.

This is why the API server is CPU-hungry: every request is schema validation, serialization, and etcd roundtrip.

KEY CONCEPT

The API server can cache reads (for popular resources, it serves from the watchCache). Writes always go through to etcd. Watches deliver events from etcd to the API server to the client in near real time.

Why this knowledge matters during incidents

Why go to the trouble of learning etcd's internals? Because during a real incident, you may need to:

1. Verify backup contents

Before restoring, open the snapshot and confirm the data you think is there actually is:

etcdctl --write-out=table snapshot status snapshot.db
# hash / revision / total keys

etcdctl get --prefix --keys-only / | wc -l
# how many keys total?

If a backup claims to have your production state but has 1/10th the keys you expect, something's wrong.

2. Debug API server problems

If kubectl is throwing weird errors, you can check if the data even exists in etcd:

etcdctl get /registry/pods/default/problematic-pod

If it's there but kubectl can't read it, the API server has a problem (RBAC, webhook, cache). If it's not there, something deleted it.

3. Clean up corrupted resources

Rare but real: an API server bug or malfunctioning webhook creates a Kubernetes object that can't be deleted normally (the API server errors on every attempt). With etcd access, you can surgically remove it:

etcdctl del /registry/stuck-resources/default/bad-object

Be careful, any direct etcd write bypasses validation. Only do this when you understand exactly what you're removing and why.

4. Understand capacity

When etcd size is approaching quota, the questions "which resources are eating the space" and "are old revisions accumulating" are answered directly from etcd.

Quiz

KNOWLEDGE CHECK

You run 'etcdctl get /registry/secrets/default/my-secret --print-value-only' on a cluster that has encryption at rest enabled. What does the output look like?

What to take away

Every Kubernetes object lives at a predictable path: /registry/<resource>/<namespace>/<name>.
Cluster-scoped resources skip the namespace segment.
Values are protobuf with a k8s\0 magic prefix; decode with auger or similar.
Encryption at rest changes the value format to k8s:enc:..., essential for Secrets in production.
etcdctl with the right certs lets you read, write, watch, and count directly.
Prefix scans are the natural query: --prefix /registry/pods/default/ returns all pods in a namespace.
Events use auto-expiring leases to keep etcd from filling up with them; still a common capacity offender.
CRDs live at /registry/<group>/<resource>/... and add to the key-space.
During incidents, direct etcd access is the escape hatch for backup verification, API server bypass, and surgical fixes.

Next module: sizing etcd for your actual cluster: the 8GB limit, when to raise it, and the disk requirements that determine whether etcd will stay healthy.

The Data Model

Continue

Sizing etcd for Your Cluster

←→ navigateM toggle sidebar