All posts
Kubernetes Debugging

Your Kubernetes Cluster Just Died at 2 AM: The Certificate Nobody Was Watching

Kubernetes certificates expire silently. No warning, no alert, no graceful degradation, just a dead cluster. Here is how to fix it in five minutes and how to make sure it never happens again.

By Sharon Sahadevan··13 min read

Your phone buzzes at 2 AM.

"Production is down. kubectl is not working. Nothing is deploying."

You open your laptop, half asleep, and run kubectl get nodes. Instead of a node list, you see this:

Unable to connect to the server: x509: certificate has expired or is not yet valid

Your cluster is unreachable. Not a pod issue. Not a network issue. The Kubernetes API server's TLS certificate expired, and the entire control plane is now rejecting every request.

No warning. No alert. No graceful degradation. Just a dead cluster.

This is one of the most common, and most preventable, Kubernetes outages. It happens to experienced teams all the time.

A note on scope. This guide is for self-managed clusters built with kubeadm. If you run EKS, GKE, or AKS, the cloud provider handles control plane certificates for you. The kubelet on your worker nodes is still your problem (more on that below), but you will not get the 2 AM API-server-cert-expired call.

Why this happens#

Kubernetes runs on TLS certificates. Every component authenticates to every other component using certificates signed by the cluster's Certificate Authority. When you set up a cluster with kubeadm, it generates all of these automatically.

Here is what most engineers do not realize: kubeadm-issued certificates expire after 1 year by default.

The cluster CA itself is valid for 10 years, but the certificates it signs (the ones the API server, controller manager, scheduler, and kubelet use to talk to each other) expire in 365 days.

There is no built-in alert. There is no warning in kubectl. The cluster runs perfectly until the second the certificate expires, and then everything stops.

The certificates you need to know#

A kubeadm cluster has several certificates. Understanding which does what is the difference between a 5-minute fix and a 2-hour scramble.

Control plane certificates (in /etc/kubernetes/pki/):

  • apiserver.crt: The API server's serving certificate. When this expires, kubectl cannot connect, nothing can talk to the API server, and the cluster is effectively dead.
  • apiserver-kubelet-client.crt: Used by the API server to authenticate to kubelets. When this expires, kubectl logs and kubectl exec stop working.
  • apiserver-etcd-client.crt: Used by the API server to talk to etcd. When this expires, the API server cannot read or write cluster state.
  • front-proxy-client.crt: Used for aggregated API servers like metrics-server. When this expires, kubectl top stops working.

etcd certificates (in /etc/kubernetes/pki/etcd/):

  • server.crt: etcd's serving certificate.
  • peer.crt: Used for etcd node-to-node communication in HA setups.
  • healthcheck-client.crt: Used by health check probes.

Kubeconfig files (in /etc/kubernetes/):

  • admin.conf: The kubeconfig you copy to ~/.kube/config. Contains an embedded client certificate.
  • controller-manager.conf: Used by kube-controller-manager.
  • scheduler.conf: Used by kube-scheduler.
  • kubelet.conf: Used by the kubelet on the control plane node.

The one that auto-renews: The kubelet's client certificate auto-rotates by default (the rotateCertificates feature, on since 1.19). This is why your worker nodes keep running even when the control plane certificates expire. They handle their own renewal.

The one that does NOT auto-renew, and burns people: The kubelet's serving certificate. By default the kubelet generates a self-signed serving cert at install and never rotates it. You only get auto-rotation if you explicitly set serverTLSBootstrap: true in the kubelet config and have a CSR approver running. When the kubelet serving cert expires, kubectl logs and kubectl exec start failing on that node, even if everything else looks fine.

Everything else does not auto-renew. You have to do it yourself.

Step 1: Diagnose what actually expired#

Before you fix anything, figure out which certificates expired. SSH into a control plane node and run:

sudo kubeadm certs check-expiration

You will see something like:

CERTIFICATE                EXPIRES                  RESIDUAL TIME
admin.conf                 May 02, 2026 14:30 UTC   <invalid>
apiserver                  May 02, 2026 14:30 UTC   <invalid>
apiserver-etcd-client      May 02, 2026 14:30 UTC   <invalid>
apiserver-kubelet-client   May 02, 2026 14:30 UTC   <invalid>
controller-manager.conf    May 02, 2026 14:30 UTC   <invalid>
front-proxy-client         May 02, 2026 14:30 UTC   <invalid>
scheduler.conf             May 02, 2026 14:30 UTC   <invalid>
etcd-healthcheck-client    May 02, 2026 14:30 UTC   <invalid>
etcd-peer                  May 02, 2026 14:30 UTC   <invalid>
etcd-server                May 02, 2026 14:30 UTC   <invalid>

Also check the CA itself. If the CA expired, you cannot use kubeadm certs renew at all:

sudo openssl x509 -in /etc/kubernetes/pki/ca.crt -noout -dates

If kubeadm is not available, check files directly:

# Check API server certificate
sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# Check all certs in the PKI directory
for cert in /etc/kubernetes/pki/*.crt /etc/kubernetes/pki/etcd/*.crt; do
  echo "=== $cert ==="
  sudo openssl x509 -in "$cert" -noout -dates
done

# Check embedded certs in kubeconfig files
sudo grep client-certificate-data /etc/kubernetes/admin.conf \
  | awk '{print $2}' | base64 -d | openssl x509 -noout -dates

Step 2: Renew#

Once you have confirmed which certificates expired, renew them all:

sudo kubeadm certs renew all

This regenerates every kubeadm-managed certificate using the existing cluster CA. Output:

[renew] Reading configuration from the cluster...
[renew] Creating a new certificate for serving the Kubernetes API server
[renew] Creating a new certificate for the API server to connect to kubelet
[renew] Creating a new certificate for the API server to connect to etcd
[renew] Creating a new certificate for the front proxy client
[renew] Creating a new certificate for etcd serving
[renew] Creating a new certificate for etcd peer connections
[renew] Creating a new certificate for the etcd healthcheck client
[renew] Creating a new certificate for the admin kubeconfig
[renew] Creating a new certificate for the controller-manager kubeconfig
[renew] Creating a new certificate for the scheduler kubeconfig

Done renewing certificates. You must restart the kube-apiserver,
kube-controller-manager, kube-scheduler and etcd to use the new certificates.

Step 3: Restart (the step everyone forgets)#

Renewing certificates does nothing until you restart the components that use them. The API server, controller manager, scheduler, and etcd are still running with the old expired certificates loaded in memory.

If your control plane runs as static pods (the kubeadm default):

# Move all static pod manifests out of the watched directory
sudo mkdir -p /tmp/k8s-manifests-backup
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/k8s-manifests-backup/

# Wait for the kubelet to stop the static pods
sleep 20

# Verify they actually stopped
sudo crictl ps | grep -E "kube-apiserver|kube-controller|kube-scheduler|etcd" \
  || echo "all control plane pods stopped"

# Move manifests back so the kubelet restarts them with the new certs
sudo mv /tmp/k8s-manifests-backup/*.yaml /etc/kubernetes/manifests/

# Wait for the API server to come back
until kubectl get --raw=/healthz 2>/dev/null | grep -q ok; do sleep 2; done
echo "API server back"

If you are running systemd services instead of static pods:

sudo systemctl restart etcd
sudo systemctl restart kube-apiserver
sudo systemctl restart kube-controller-manager
sudo systemctl restart kube-scheduler

Step 4: Update your kubeconfig#

The admin kubeconfig you copied to ~/.kube/config still has the old expired certificate embedded. Update it:

# On the control plane node
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

If you distributed admin.conf to other machines (CI/CD pipelines, developer workstations, monitoring tools), you need to update those too. Every copy of the old kubeconfig is now useless.

Step 5: Verify#

# This should work now
kubectl get nodes

# Verify the new expiry
sudo kubeadm certs check-expiration

# Test that logs and exec work (apiserver-kubelet-client cert)
APISERVER_POD=$(kubectl get pods -n kube-system -l component=kube-apiserver \
  -o name | head -1)
kubectl logs -n kube-system $APISERVER_POD --tail=5

# Test that metrics work (front-proxy-client cert)
kubectl top nodes

Step 6: HA control plane#

If you run multiple control plane nodes, you need to renew certificates on every control plane node. The certificates are local to each node and are not synchronized.

# Repeat on each control plane node
ssh control-plane-2 'sudo kubeadm certs renew all'
ssh control-plane-2 'sudo mv /etc/kubernetes/manifests/*.yaml /tmp/ && sleep 20 \
                     && sudo mv /tmp/*.yaml /etc/kubernetes/manifests/'

A subtle HA gotcha: when a cluster is built incrementally (control planes added over time), each node's certificates were generated on a different day, so they expire on different days. Instead of one clean outage you get a slow drip: one node's etcd peer cert expires, that member drops, you still have quorum so nothing alerts, then the second node's cert expires and the cluster goes into split-brain or loses quorum entirely. Fix every node at once after a renewal, not one at a time.

How to make sure this never happens again#

Fixing expired certificates at 2 AM is a fire drill you should only experience once. Three options, in increasing order of how much you should prefer each.

Option 1: Monitor expiry (the floor)#

The most common Prometheus metric people grab for this is wrong. apiserver_client_certificate_expiration_seconds measures the expiry of client certs presented to the apiserver, not the apiserver's own server cert. It will not fire when your apiserver cert is about to expire.

The right approach for self-managed clusters: a node-exporter textfile collector that reads the actual files. Drop this script in /etc/cron.daily/k8s-cert-metrics:

#!/bin/bash
# Emit per-cert expiry as a Prometheus metric for node_exporter to scrape
set -euo pipefail

OUT=/var/lib/node_exporter/textfile_collector/k8s_certs.prom
TMP=$(mktemp)

echo "# HELP k8s_cert_expiry_seconds Seconds until certificate expires" > $TMP
echo "# TYPE k8s_cert_expiry_seconds gauge" >> $TMP

for cert in /etc/kubernetes/pki/*.crt /etc/kubernetes/pki/etcd/*.crt; do
  [ -f "$cert" ] || continue
  expiry=$(openssl x509 -in "$cert" -noout -enddate | cut -d= -f2)
  expiry_epoch=$(date -d "$expiry" +%s)
  now_epoch=$(date +%s)
  remaining=$((expiry_epoch - now_epoch))
  name=$(basename "$cert" .crt)
  echo "k8s_cert_expiry_seconds{cert=\"$name\",path=\"$cert\"} $remaining" >> $TMP
done

mv $TMP $OUT

Then alert on it:

- alert: KubernetesCertificateExpiringSoon
  expr: k8s_cert_expiry_seconds < 30 * 24 * 3600
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "{{ $labels.cert }} expires in less than 30 days"

- alert: KubernetesCertificateExpiringCritical
  expr: k8s_cert_expiry_seconds < 7 * 24 * 3600
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.cert }} expires in less than 7 days"

Now you get a Slack ping a month before the cert dies, instead of a page at 2 AM.

Option 2: Automate renewal#

A monthly cron is fine for clusters where a 30-second control plane restart in the middle of the night is acceptable. The script needs to be more careful than a one-liner because the cron runs unattended at 3 AM.

#!/bin/bash
# /usr/local/sbin/k8s-cert-renew.sh
set -euo pipefail

LOG=/var/log/k8s-cert-renew.log
exec > >(tee -a $LOG) 2>&1
echo "=== $(date -u): starting cert renewal ==="

kubeadm certs renew all

# Restart static pods by moving manifests out and back
BACKUP=/tmp/k8s-manifests-backup-$(date +%s)
mkdir -p $BACKUP
mv /etc/kubernetes/manifests/*.yaml $BACKUP/

sleep 20

mv $BACKUP/*.yaml /etc/kubernetes/manifests/
rmdir $BACKUP

# Wait for API server to come back
for i in $(seq 1 60); do
  if kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/healthz \
       2>/dev/null | grep -q ok; then
    echo "API server healthy"
    break
  fi
  sleep 2
done

# Update root's kubeconfig
cp /etc/kubernetes/admin.conf /root/.kube/config

echo "=== $(date -u): renewal complete ==="
# /etc/cron.d/k8s-cert-renewal
0 3 1 * * root /usr/local/sbin/k8s-cert-renew.sh

For HA, schedule different control plane nodes on different days of the month so you do not restart all of them at once.

Option 3: Upgrade regularly (the real answer)#

Here is the thing most people miss: kubeadm upgrade automatically renews all certificates as part of the upgrade. If you upgrade your cluster at least once a year (which you should be doing anyway, since each Kubernetes release is supported for about 14 months), you will never hit certificate expiry.

This is the strongest argument for keeping your cluster on a recent version. Teams that skip upgrades are the ones who get the 2 AM call.

If you absolutely cannot upgrade often, you can also issue certificates with longer validity at install time using a kubeadm config with clusterCAFile and a custom certificate-renewal lifetime, but that is a workaround, not a fix.

The common mistakes#

Mistake 1: Renewing but not restarting. The certificates are renewed on disk, but the running processes still hold the old certificates in memory. You must restart the control plane components.

Mistake 2: Only renewing on one control plane node. In HA setups, each node has its own certificates. Renew on all of them.

Mistake 3: Forgetting the kubeconfig files. admin.conf, controller-manager.conf, scheduler.conf contain embedded certificates that also expire. kubeadm certs renew all regenerates these on the local node, but you still need to redistribute the new admin.conf to anywhere it was copied (CI runners, developer laptops, monitoring agents).

Mistake 4: Not checking the CA certificate. The CA is valid for 10 years, but if your cluster is old enough or if the CA was created with shorter validity, the CA itself can expire. You cannot renew certificates signed by an expired CA. CA renewal is a much harder procedure (see the kubeadm docs for kubeadm certs renew --help and the manual CA rotation guide). Catch CA expiry months in advance with the monitoring above.

Mistake 5: Forgetting the kubelet serving cert. Auto-rotation covers the kubelet client cert, not the serving cert. Either turn on serverTLSBootstrap: true and run a CSR approver, or accept that you need to rotate kubelet serving certs manually too.

Mistake 6: No monitoring. If you do not have an alert for certificate expiry, you will get that 2 AM call eventually. It is not a question of if, it is when.

Quick reference: the 2 AM checklist#

When you get the call, follow this in order:

1. SSH into a control plane node
2. sudo kubeadm certs check-expiration             # confirm what expired
3. sudo openssl x509 -in /etc/kubernetes/pki/ca.crt -noout -dates  # confirm CA still valid
4. sudo kubeadm certs renew all                    # renew everything
5. Restart control plane:
     sudo mv /etc/kubernetes/manifests/*.yaml /tmp/
     sleep 20
     sudo mv /tmp/*.yaml /etc/kubernetes/manifests/
6. Wait for API server:
     until kubectl get --raw=/healthz 2>/dev/null | grep -q ok; do sleep 2; done
7. sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
8. kubectl get nodes                               # verify it works
9. Repeat steps 2-7 on every other control plane node (HA)
10. Update kubeconfigs on CI/CD, monitoring, dev machines
11. Set up monitoring so this never happens again

Total time if you know what you are doing: 5 minutes per control plane node.

Total time if you do not: 2 hours of Googling, panicking, and accidentally making it worse.


This is exactly the kind of scenario covered in depth in the Kubernetes Security course, where certificates are one of 40 topics from API server hardening to runtime threat detection. And in the Kubernetes Cluster Upgrades course, we cover how regular upgrades prevent certificate expiry entirely.

More in Kubernetes Debugging