Kubernetes Security for DevOps Engineers

etcd: Where All State Lives

An attacker gained access to your etcd cluster. They now have every secret in the cluster in plaintext. How did this happen?

If the API server is the front door of a cluster, etcd is the safe in the back room. Every Secret you have ever stored, every ServiceAccount token, every ConfigMap, every RBAC binding lives in etcd. The API server reads from and writes to etcd; everything else is decoration around that storage layer.

That makes etcd the single most valuable target in any Kubernetes cluster. An attacker with read access to etcd has every credential and secret without needing any API call, RBAC permission, or admission control bypass. A defender who fails to harden etcd has built every other security control on a foundation that can be lifted out from underneath.

This lesson is the etcd security model: what it stores, how it stores it, and the hardening that turns it from a single point of failure into a defensible component.

Attack path / threat explanation

Three distinct ways an attacker reaches etcd:

1. Direct network access to the etcd port. Default port 2379 (client) and 2380 (peer). If the attacker is on the same network as the etcd nodes (compromised node, lateral movement, misconfigured network policy at the infrastructure layer), they can connect directly. If etcd does not require client mTLS, they can read everything immediately.

2. Compromise of an etcd node. The etcd data file (/var/lib/etcd by default) lives on disk on each etcd member. An attacker with root on an etcd node has the entire database in raw form. If encryption at rest is not configured, the secrets are in plaintext.

3. Backup theft. Production clusters back up etcd regularly. The backup is a snapshot of the database, sitting in object storage (S3, GCS, etc.). If the bucket has weak access controls, an attacker downloads the backup and gets everything that was in the cluster at that point in time. This is the most common path: backups are easier to access than running etcd nodes.

The data an attacker gets includes:

  • Every Secret in the cluster. Database passwords, API keys, OAuth tokens, TLS private keys, encrypted-but-key-also-in-cluster keys.
  • Every ServiceAccount token. Long-lived tokens (legacy) or projection metadata that lets the attacker request bound tokens.
  • Every RBAC binding. Maps from identity to permission across the cluster.
  • Every ConfigMap. Often contains config secrets that should have been Secrets, plus all application configuration.

The blast radius of an etcd compromise is the entire cluster's worth of credentials. Recovery requires rotating every credential, every certificate, and every token.

KEY CONCEPT

etcd is not just where state lives; it is where every secret, credential, and token lives. Compromise of etcd is functionally equivalent to compromise of every workload in the cluster, because every workload's credentials are in etcd. Plan etcd security as if it were the secret you most need to protect, because it is.

How it works under the hood

The etcd security architecture in a production cluster:

etcd security: where each control attaches

API server (the only legitimate client)
Client mTLS (port 2379)
Peer mTLS (port 2380)
Encryption at rest (EncryptionConfiguration)
Backup storage
Network isolation

Hover components for details

The four hardening layers, each independent:

1. Network isolation. Etcd should not be reachable from anywhere except the API server. Put etcd on a dedicated subnet; use cloud security groups (AWS), firewall rules (GCP), NSGs (Azure) to restrict traffic to and from etcd to only the API server CIDR.

2. Mutual TLS for both client and peer. Both ports use TLS, and both require client cert validation. The CA for client mTLS is separate from the CA for peer mTLS in production setups.

3. Encryption at rest. Configured via EncryptionConfiguration on the API server. The API server encrypts Secret data with a configured key (AES-CBC, AES-GCM, or KMS-backed) before sending to etcd. The encryption is symmetric; the key is held by the API server. If the key is in a Kubernetes Secret in the same cluster, you have a chicken-and-egg problem; use a KMS provider (AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault) instead.

4. Backup security. Etcd backups are full snapshots. They must be encrypted at rest in the backup destination, access-controlled with IAM (not public buckets), and retention-managed. A compromised backup gives the same data as a compromised live etcd.

The etcd database structure that matters operationally:

  • WAL (write-ahead log): every Raft entry. Append-only, fsynced.
  • boltdb backend: the materialized state. This is what readers see.
  • Snapshot: periodic point-in-time captures of the boltdb file. Used for backups and for new members joining the cluster.

Each of these files is sensitive. Secret data lives in all three (in encrypted form if encryption at rest is configured; in plaintext if not).

Defense architecture

A complete etcd security checklist for a production cluster:

1. Encryption at rest enabled with a real KMS provider. Not "I have a key file on disk;" KMS-backed encryption with key rotation.

2. mTLS for client and peer connections. Verify with etcdctl --cacert ... --cert ... --key ... endpoint health. If this works without certs, your etcd is unauthenticated.

3. Network access restricted at the infrastructure layer. Cloud security groups, firewall rules, or VLAN isolation. The etcd port should be unreachable from anywhere except the API server.

4. Dedicated nodes for etcd at scale. For clusters above 100 nodes, etcd on its own VMs separate from the rest of the control plane. Reduces blast radius of node compromise.

5. Backups encrypted, access-controlled, retention-managed. S3 with bucket encryption, IAM-restricted, lifecycle policies for retention. Backups treated as the secret material they are.

6. Audit logging at the API server level captures Secret access. etcd itself does not log read access in detail; the API server's audit log is the place to see "user X read Secret Y."

7. Periodic key rotation. The encryption-at-rest key should rotate on a schedule (typically annually). The rotation procedure rewrites every encrypted resource in etcd; plan the operational window.

Configuration examples

A minimal EncryptionConfiguration for the API server:

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - kms:
          name: my-kms-provider
          endpoint: unix:///var/run/kmsplugin/socket.sock
          cachesize: 100
          timeout: 3s
      - aescbc:
          keys:
            - name: key1
              secret: <base64-encoded-32-byte-key>
      - identity: {}  # fallback; reads existing unencrypted data

Provider order matters: encryption uses the first provider; decryption tries each in order. The identity provider at the end allows reading data that was previously stored unencrypted (during migration).

Reference the file in the API server flags:

kube-apiserver \
  --encryption-provider-config=/etc/kubernetes/encryption-config.yaml \
  ...

After enabling, rewrite all existing Secrets so they are encrypted with the new key:

# For each namespace, re-apply all secrets to trigger re-encryption
kubectl get secrets --all-namespaces -o json \
  | kubectl replace -f -

For etcd mTLS verification:

# From an etcd node, verify mTLS is required
etcdctl --endpoints=https://localhost:2379 endpoint health
# Should fail without client cert.

# With client cert
etcdctl --endpoints=https://localhost:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  endpoint health

For backup security, a sample S3 bucket policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/etcd-backup-role"
      },
      "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-etcd-backups",
        "arn:aws:s3:::my-etcd-backups/*"
      ]
    },
    {
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::my-etcd-backups",
        "arn:aws:s3:::my-etcd-backups/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::123456789012:role/etcd-backup-role"
        }
      }
    }
  ]
}

The deny-by-default pattern: only the etcd-backup-role can access; everyone else denied even if they have other permissions.

Common misconfigurations

  • No encryption at rest. Default in many distros. Secrets stored in plaintext in etcd. Catastrophic if etcd is compromised.
  • Encryption key on disk on the same machine as etcd. Defeats the purpose; an attacker with file access has both the key and the data.
  • etcd reachable from worker nodes or pods. Network controls missing or wrong. Lateral movement from a compromised pod to etcd is then trivial.
  • No mTLS between etcd members. Allows a malicious member to join the cluster.
  • Backups in a public bucket. Embarrassingly common. Audit object storage permissions for any bucket containing etcd data.
  • No backup encryption. Backups in S3 without server-side encryption are equivalent to plaintext data on a public network.
  • No periodic key rotation. A key in use for years is a key that has likely been seen by enough people to assume compromise.
  • Skipping the audit log on Secret access. etcd does not log who reads what; API server audit does. Without it, you cannot tell if Secrets were exfiltrated.
INTERVIEW QUESTION

If someone has read access to etcd, what can they get? How do you prevent this?