All posts
Security

cert-manager Renewed Your Certificate. Your App Still Serves the Old One. Why?

The certificate in the Secret is fresh. The pod is still serving the expired one. cert-manager did its job. Your app did not. The five renewal failures that bite production.

By Sharon Sahadevan··11 min read

A monitoring alert fires. https://api.example.com is serving an expired TLS certificate. You SSH into a node, dig around, and find this:

kubectl get certificate -n production api-tls -o yaml | grep -E "renewalTime|notAfter"
status:
  notAfter: "2026-08-04T10:00:00Z"     # 90 days out, freshly renewed
  renewalTime: "2026-07-15T10:00:00Z"

The certificate in the Kubernetes Secret is fresh. cert-manager did its job. The browser still sees the expired one.

What happened? The pod, the ingress controller, or the service mesh that consumes the Secret never reloaded the certificate after cert-manager rotated it. The new bytes are in etcd. The running process is still holding the old bytes in memory.

This is one of the most common cert-manager production bugs, and there are several variants. This post is the catalog: the five real renewal failure modes I have seen in production, what causes each, and how to fix them.

The setup: how cert-manager is supposed to work#

cert-manager is a controller. It watches Certificate custom resources, talks to ACME (Let's Encrypt), Vault, your internal CA, or a self-signing issuer, and writes the resulting cert/key into a Kubernetes Secret. When the cert nears expiry, it reorders, gets a new one, and overwrites the Secret.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
  namespace: production
spec:
  secretName: api-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - api.example.com

cert-manager creates the Secret api-tls containing tls.crt and tls.key. Some other piece of software (an ingress controller, a service mesh, an application) consumes the Secret. That consumer is where the failure lives.

Failure 1: ingress-nginx not reloading on Secret update#

ingress-nginx (the most common ingress controller) used to have this bug as a default: when the underlying Secret changed, the running NGINX process kept serving the old cert until something restarted the pod or forced a config reload.

Modern ingress-nginx detects Secret changes via informers and triggers a Lua-based reload that swaps certs in-place without dropping connections. But this only works if the cert references the Secret correctly and the ingress controller's RBAC lets it watch Secrets cluster-wide.

Common cause: the Ingress object references a Secret in a different namespace, and the controller's RBAC does not let it read Secrets there. The original cert was loaded at startup with broader permissions (or because the team copied the Secret manually), but new versions are invisible to the controller.

Fix:

# Verify the controller can see the Secret
kubectl auth can-i get secrets -n production \
  --as=system:serviceaccount:ingress-nginx:ingress-nginx

# Check ingress-nginx logs for cert reload events
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller \
  | grep -E "reloading|certificate"

# Force a reload to verify the cert load works
kubectl rollout restart deploy/ingress-nginx-controller -n ingress-nginx

If restart fixes the issue but it returns next renewal, the controller is loading at startup but not watching for updates. Check RBAC and informer scope.

Failure 2: application reading the cert at startup, never re-reading#

Your app loads /etc/tls/tls.crt and /etc/tls/tls.key from a mounted Secret volume at startup, then never looks at the files again. Kubernetes will update the projected files in-place when the Secret changes (for secret volumes, on the next sync interval, typically within 60 seconds), but your app never notices.

This is the most insidious version because everything looks correct from outside the pod:

kubectl exec -it $POD -- ls -la /etc/tls/
# Files are dated today (recent renewal)

kubectl exec -it $POD -- openssl x509 -in /etc/tls/tls.crt -noout -dates
# Cert is fresh

The files are fresh. The process is still using the cached old version it parsed at startup.

Fix options, in increasing order of effort:

Option A: pod restart on Secret change. Add a Reloader controller (stakater/Reloader) that watches Secrets and rolls the Deployment when the Secret is updated:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  annotations:
    secret.reloader.stakater.com/reload: "api-tls"

The Deployment rolls when api-tls changes. Simple, brute force, works.

Option B: app re-reads cert on file change. Use a file watcher (inotify on Linux, fsnotify in Go, watchdog in Python) inside the app to re-read certs when the file mtime changes. This is the cleanest answer for long-lived processes that cannot afford a restart.

// Go example with crypto/tls and a cert reloader
type certReloader struct {
    mu       sync.RWMutex
    cert     *tls.Certificate
    certPath string
    keyPath  string
}

func (r *certReloader) GetCertificate(_ *tls.ClientHelloInfo) (*tls.Certificate, error) {
    r.mu.RLock()
    defer r.mu.RUnlock()
    return r.cert, nil
}

func (r *certReloader) reload() error {
    cert, err := tls.LoadX509KeyPair(r.certPath, r.keyPath)
    if err != nil {
        return err
    }
    r.mu.Lock()
    r.cert = &cert
    r.mu.Unlock()
    return nil
}

// Use as TLS callback:
tlsConfig.GetCertificate = reloader.GetCertificate
// Set up a file watcher to call reloader.reload() on change

This is what good gRPC servers, Envoy, and modern HTTP servers do. Worth the engineering investment for any production-facing service.

Option C: SIGHUP handler. A few servers reload on SIGHUP (NGINX, HAProxy). Pair with a sidecar that watches the Secret and sends kill -HUP to the main process.

Failure 3: ACME rate limit hit#

Let's Encrypt's production rate limits include 50 certificates per registered domain per week. A misconfigured cluster that creates and deletes Certificates in a loop (a faulty operator, a CI pipeline that recreates resources on every run, an over-eager Kustomize patch) can exhaust this in hours.

Once you hit the limit, cert-manager's renewal attempts fail and the existing cert eventually expires. Check the cert-manager logs:

kubectl logs -n cert-manager deploy/cert-manager | grep -E "rate.?limit|too many"

You will see something like urn:ietf:params:acme:error:rateLimited.

Fix:

  1. Stop the loop. Find the controller or pipeline creating duplicate Certificates and fix it. kubectl get certificate --all-namespaces -o wide | sort -k 5 shows the per-cert age; recently-created ones are suspects.
  2. Use the staging issuer for development. Let's Encrypt has a staging environment with much higher rate limits. Set up a letsencrypt-staging ClusterIssuer for non-production work.
  3. Wait it out, or request a rate limit increase. The 50/domain/week limit resets weekly. For legitimate high-volume use, fill out the rate limit form.
# Use staging for dev/staging clusters
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
      - http01:
          ingress:
            class: nginx

Failure 4: HTTP-01 challenge can't reach the cluster#

cert-manager requests a cert from Let's Encrypt; LE responds with a challenge (HTTP-01: serve /.well-known/acme-challenge/TOKEN from your domain); cert-manager creates a Pod and Ingress to serve the challenge; LE fetches the URL and validates.

Common failures:

  • Cluster is private, LE can't reach it. HTTP-01 won't work; you need DNS-01 (uses DNS TXT records, no public HTTP needed).
  • Ingress class mismatch. The cert-manager-created challenge Ingress has the wrong class annotation, or it's filtered out by your ingress controller's selector.
  • Hostname mismatch. The challenge resolves to a different IP than your real ingress (Cloudflare proxying, multi-region setups with regional DNS).
  • WAF blocking the challenge path. CloudFlare, AWS WAF, or other edge protections sometimes block /.well-known/ paths by default.

Diagnose with:

# Check challenge events
kubectl get challenges --all-namespaces
kubectl describe challenge -n production api-tls-XXX

# Try to fetch the challenge URL externally as LE would
curl -v http://api.example.com/.well-known/acme-challenge/test
# Should reach your cluster's challenge pod

For private clusters, switch to DNS-01:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-dns-key
    solvers:
      - dns01:
          route53:
            region: us-east-1
            # IRSA-bound IAM role with route53:ChangeResourceRecordSets

DNS-01 also lets you issue wildcard certs (*.example.com), which HTTP-01 cannot.

Failure 5: Secret synced to ESO/Vault, downstream apps using stale copy#

A team uses External Secrets Operator (or Vault sync) to mirror cert-manager Secrets into other namespaces or external systems. cert-manager updates the source; ESO updates the destination on its refreshInterval; downstream consumers are still on whatever they cached at last reload.

The chain looks fine on inspection (every link has a fresh cert) but only the leaf consumers actually serve traffic, and they are using their cached copy.

This is essentially Failure 2 with extra hops. Each consumer needs its own reload-on-change mechanism.

Diagnostic:

# Compare cert serial numbers at each layer
# 1. cert-manager source
kubectl get secret -n production api-tls -o jsonpath='{.data.tls\.crt}' \
  | base64 -d | openssl x509 -noout -serial

# 2. ESO destination
kubectl get secret -n other-namespace api-tls-mirror -o jsonpath='{.data.tls\.crt}' \
  | base64 -d | openssl x509 -noout -serial

# 3. What the actual app is serving
echo | openssl s_client -connect api.example.com:443 -servername api.example.com 2>/dev/null \
  | openssl x509 -noout -serial

If the three serials don't match, you have stale data somewhere in the chain. Trace from the end (the app's view) backward to find the layer that hasn't picked up the new cert.

How to set this up so it just works#

A reliable cert-manager setup has these properties:

1. Issuer separation. letsencrypt-prod for production, letsencrypt-staging for dev/staging. Test issuance changes against staging first.

2. Monitoring on the renewal pipeline. Alerts on:

  • Certificate.status.notAfter getting close (more than 1 day before expected renewal)
  • Certificate.status.conditions[Ready] = False
  • cert-manager controller error rate
  • Per-domain ACME failures from cert-manager metrics
# Alert: certificate not ready
certmanager_certificate_ready_status{condition="False"} == 1

# Alert: certificate expiring within 14 days but not yet renewed
(certmanager_certificate_expiration_timestamp_seconds - time()) < 14*24*3600

# Alert: ACME order failures
rate(certmanager_acme_client_request_count{status!~"^2.."}[1h]) > 0

3. Reload-on-change for every consumer. Either:

  • The consumer watches its cert files (best)
  • Reloader controller restarts the consumer when the Secret updates (acceptable)
  • Pinned restart cadence that aligns with renewal (acceptable for batch workloads)

4. End-to-end serial number monitoring. A synthetic check that hits the public endpoint, parses the serial, compares to the cert in the cluster Secret. Alerts if they diverge for more than 5 minutes.

#!/bin/bash
# /etc/cron.d/cert-serial-check
DOMAIN=api.example.com
NS=production
SECRET=api-tls

PUBLIC_SERIAL=$(echo | openssl s_client -connect $DOMAIN:443 -servername $DOMAIN 2>/dev/null \
  | openssl x509 -noout -serial 2>/dev/null | cut -d= -f2)

CLUSTER_SERIAL=$(kubectl get secret -n $NS $SECRET -o jsonpath='{.data.tls\.crt}' \
  | base64 -d | openssl x509 -noout -serial | cut -d= -f2)

if [ "$PUBLIC_SERIAL" != "$CLUSTER_SERIAL" ]; then
  echo "WARN: $DOMAIN serving $PUBLIC_SERIAL but cluster has $CLUSTER_SERIAL"
  # send alert
fi

5. Dry-run on issuer changes. Before changing the production ClusterIssuer (e.g., switching from HTTP-01 to DNS-01), test against the staging issuer with a sample Certificate.

The mental model#

cert-manager's job ends when the Secret is updated. From there, the chain of consumers (ingress, mesh, app) is your responsibility. Most production "cert-manager bugs" are actually consumer bugs: the consumer parsed the cert at startup and never reparsed.

Fix the chain at the leaf: the thing actually serving TLS traffic. If it does not reload on Secret change, your renewal automation is decoration. The cert in the Secret is fresh; the cert on the wire is what users see.

Quick reference: the "TLS expired in production" checklist#

1. Compare serial numbers up the chain:
   - Public TLS endpoint (what users see)
   - Cluster Secret (what cert-manager wrote)
   - Any intermediate copies (ESO mirrors, Vault sync)
   First mismatch = first stale layer.

2. Check cert-manager status:
   kubectl get certificate --all-namespaces
   kubectl get challenges --all-namespaces  (if HTTP-01)
   kubectl logs -n cert-manager deploy/cert-manager | tail -50

3. If cert-manager is OK but consumer is stale:
   Force consumer reload:
   - kubectl rollout restart deploy/ingress-nginx-controller
   - kubectl rollout restart deploy/api
   Confirm public serial matches.

4. If cert-manager itself is failing:
   - Rate limit? wait or switch to staging
   - Challenge not reachable? check ingress class, WAF, public DNS
   - Issuer config? compare against a known-good cluster

5. Make it permanent:
   - Reloader controller for consumers that don't watch
   - File-watcher reload in app code for long-lived servers
   - Synthetic serial-number check as the final monitor

What to instrument before the next renewal#

If you only set up two alerts: certificate-not-ready and serial-number-divergence. The first catches cert-manager-side failures before users see them; the second catches consumer-side failures. Together, every TLS expiry incident becomes a 30-day-warning ticket instead of a 2 AM page.


The full cert-manager and TLS renewal lifecycle, including private CAs, mTLS rotation, and HSM-backed signing, is covered in the SSL/TLS Certificate Management course. The Kubernetes-specific aspects (Ingress TLS, kubelet certs, etcd certs, mesh mTLS) are part of the Kubernetes Security course.

More in Security