Debugging Certificate Errors in Production
It is 2 AM. PagerDuty is screaming. Your monitoring dashboard shows HTTPS failures spiking across three services. Customers are seeing "Your connection is not private" errors. The on-call Slack channel is filling up with "Is the site down?"
You SSH into the load balancer. The logs say:
SSL certificate problem: unable to verify the first certificate.You do not have time to read documentation. You need the exact error, the exact cause, and the exact fix. That is what this lesson is — a field guide to every certificate error you will encounter in production, with the OpenSSL commands to diagnose and fix each one.
Part 1: The OpenSSL Debugging Toolkit
Before we get to the errors, you need five OpenSSL commands burned into muscle memory. These five commands will diagnose 95% of certificate problems.
# 1. Test the TLS connection (the first thing you run)
openssl s_client -connect host:443 -servername host
# 2. Read a certificate file (decode what is inside)
openssl x509 -text -noout -in cert.pem
# 3. Verify the certificate chain (is the chain valid?)
openssl verify -CAfile ca-bundle.crt cert.pem
# 4. Check certificate expiry (the most common problem)
openssl x509 -noout -dates -in cert.pem
# 5. See the full chain from a remote server
openssl s_client -showcerts -connect host:443 -servername host
Memorize these five commands. During an outage, you do not have time to search Stack Overflow. Run command 1 first — it shows you the protocol version, cipher, certificate chain, and verification result in one shot. That single output will point you to the problem 80% of the time.
Let us also add a one-liner that combines the most common checks into a single command:
# The "tell me everything about this server's TLS" one-liner
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -subject -issuer -dates -ext subjectAltName
# Output:
# subject=CN = devopsbeast.com
# issuer=C = US, O = Let's Encrypt, CN = R3
# notBefore=Mar 1 00:00:00 2026 GMT
# notAfter=May 30 23:59:59 2026 GMT
# X509v3 Subject Alternative Name:
# DNS:devopsbeast.com, DNS:*.devopsbeast.com
Add this as a shell alias: alias tlscheck='f(){ echo | openssl s_client -connect "$1" -servername "${1%%:*}" 2>/dev/null | openssl x509 -noout -subject -issuer -dates -ext subjectAltName; }; f'. Then you can just run tlscheck devopsbeast.com:443 from anywhere.
Part 2: The 10 Certificate Errors
Error 1: "certificate has expired"
Full error: verify error:num=10:certificate has expired
What happened: The certificate's Not After date has passed. The certificate was valid but is no longer.
Diagnostic:
# Check expiry from the remote server
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -dates
# notAfter=Jan 15 23:59:59 2026 GMT ← this date has passed
# Check expiry of a local cert file
openssl x509 -noout -enddate -in /etc/ssl/certs/server.crt
Fix:
- Renew the certificate (Let's Encrypt:
certbot renew; manual: request new cert from CA) - Deploy the new certificate to the server
- Reload the web server (
nginx -s reload,systemctl reload apache2)
# Force renewal with certbot
certbot renew --force-renewal
# Verify the new cert is loaded
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -dates
# notAfter should now be in the future
A company ran cert-manager in Kubernetes for automated renewal. It worked perfectly for 18 months. Then the cluster was migrated to a new namespace, but the cert-manager ClusterIssuer still referenced the old namespace for the DNS solver ServiceAccount. Renewals silently failed for 60 days. The cert expired on a Saturday night. Lesson: monitor certificate expiry with alerts at 30, 14, and 7 days. Do not assume automation is working just because it worked before.
Error 2: "unable to verify the first certificate"
Full error: verify error:num=21:unable to verify the first certificate
What happened: The server sent the leaf certificate but NOT the intermediate certificate. The client cannot verify who signed the leaf.
Diagnostic:
# Check what certs the server sends
openssl s_client -showcerts -connect host:443 -servername host 2>/dev/null
# If you see only ONE "BEGIN CERTIFICATE" block — the intermediate is missing
# You should see TWO: the leaf and the intermediate
Fix: Configure the server to send the full chain (leaf + intermediate):
# Create the full chain file
cat leaf.crt intermediate.crt > fullchain.crt
# nginx: use the fullchain
# ssl_certificate /etc/ssl/fullchain.crt;
# Kubernetes: recreate the Secret with the full chain
kubectl create secret tls my-tls --cert=fullchain.crt --key=private.key --dry-run=client -o yaml \
| kubectl apply -f -
This is the most common certificate error in production. It is especially insidious because desktop browsers often work fine (they fetch the missing intermediate via AIA), so the engineer testing the deployment sees no error. The failure only shows up in API clients, mobile apps, monitoring tools, and CI pipelines — often hours or days after deployment.
Error 3: "certificate signed by unknown authority"
Full error: x509: certificate signed by unknown authority
What happened: The certificate chain leads to a root CA that is not in the client's trust store. This happens with:
- Self-signed certificates
- Private/internal CAs
- Certificates from CAs not trusted by the client OS
Diagnostic:
# Check the issuer chain
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| grep -E "issuer|Verify return"
# Verify return code: 19 (self-signed certificate in certificate chain)
# OR
# Verify return code: 20 (unable to get local issuer certificate)
Fix: Either use a publicly trusted CA (Let's Encrypt) or add the CA to the client's trust store:
# On Ubuntu/Debian: add a custom CA
cp custom-ca.crt /usr/local/share/ca-certificates/
update-ca-certificates
# On RHEL/CentOS: add a custom CA
cp custom-ca.crt /etc/pki/ca-trust/source/anchors/
update-ca-trust
# In Python requests
import requests
response = requests.get("https://internal-service.corp", verify="/path/to/ca-bundle.crt")
# In curl
curl --cacert /path/to/ca-bundle.crt https://internal-service.corp
Error 4: "certificate is valid for X, not Y"
Full error: x509: certificate is valid for app.example.com, not api.example.com
What happened: The domain in the URL does not match any domain in the certificate's Subject Alternative Name (SAN) field.
Diagnostic:
# Check what domains the cert covers
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -ext subjectAltName
# X509v3 Subject Alternative Name:
# DNS:app.example.com, DNS:www.example.com
# Missing: api.example.com
Fix: Get a new certificate that includes the correct domain, or use a wildcard:
# Wildcard cert covers all subdomains
certbot certonly --dns-cloudflare -d "*.example.com" -d "example.com"
When you see domain mismatch errors, also check if the client is connecting to the correct IP. A common cause is DNS returning a different server (e.g., a CDN edge server, a load balancer with a default cert, or a different service on the same IP). Run dig host and verify the IP, then run openssl s_client -connect IP:443 -servername host to check what cert that IP serves.
Error 5: "wrong version number"
Full error: SSL routines:ssl3_get_record:wrong version number
What happened: You are trying to make a TLS connection to a port that is serving plain HTTP (not HTTPS). The client sends a TLS ClientHello, but the server responds with an HTTP response. The TLS parser cannot decode HTTP as a TLS record.
Diagnostic:
# This will show the error
openssl s_client -connect host:80 -servername host
# CONNECTED(00000003)
# 140234:error:1408F10B:SSL routines:ssl3_get_record:wrong version number
# Try connecting to the right port
openssl s_client -connect host:443 -servername host
Fix: Connect to port 443 (HTTPS) instead of port 80 (HTTP). Or check if the server has TLS configured at all:
# Check if the port speaks HTTP or HTTPS
curl -v http://host:8080 2>&1 | head -5
# If you see "HTTP/1.1 200" — it is HTTP, not HTTPS
curl -v https://host:8080 2>&1 | head -5
# If you see "wrong version number" — port 8080 is HTTP, not HTTPS
Error 6: "certificate verify failed (self-signed certificate)"
Full error: SSL: CERTIFICATE_VERIFY_FAILED - self-signed certificate
What happened: The server is presenting a self-signed certificate (not signed by any CA). The client's trust store does not contain this certificate.
Diagnostic:
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| grep "Verify return"
# Verify return code: 18 (self-signed certificate)
# Confirm: issuer matches subject
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -subject -issuer
# subject=CN = myservice.local
# issuer=CN = myservice.local ← same as subject = self-signed
Fix: Replace with a certificate from a trusted CA. For internal services, use a proper internal CA (step-ca, HashiCorp Vault PKI, cert-manager with a private ClusterIssuer).
Error 7: "remote error: tls: bad certificate"
Full error: remote error: tls: bad certificate (Go) or SSL: SSLV3_ALERT_BAD_CERTIFICATE (OpenSSL)
What happened: The server requires mutual TLS (mTLS) — it sent a CertificateRequest during the handshake, but the client did not provide a client certificate, or the client certificate was rejected.
Diagnostic:
# Check if the server requests a client certificate
openssl s_client -connect host:443 -servername host 2>/dev/null | grep "Acceptable client"
# If you see CA names listed — the server requires mTLS
# Connect with a client certificate
openssl s_client -connect host:443 -servername host \
-cert client.crt -key client.key
Fix: Provide a valid client certificate that is signed by a CA the server trusts:
# In curl
curl --cert client.crt --key client.key https://host/api
# In Python
requests.get("https://host/api", cert=("client.crt", "client.key"))
Mutual TLS (mTLS) is increasingly common in Kubernetes environments — service meshes like Istio and Linkerd use mTLS for all pod-to-pod communication. If you see "bad certificate" errors in a service mesh environment, check whether the sidecar proxy has a valid client certificate and whether the mesh CA (Citadel, trust-manager) is healthy.
Error 8: "SSL: CERTIFICATE_VERIFY_FAILED" (Python)
Full error: ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate
What happened: Python cannot find the system CA bundle. This is common in:
- Docker containers with minimal base images (alpine, scratch)
- macOS after a Python upgrade (Python does not use the macOS Keychain by default)
- Virtual environments that do not inherit system certificates
Diagnostic:
# Check where Python looks for CA certs
python3 -c "import ssl; print(ssl.get_default_verify_paths())"
# DefaultVerifyPaths(cafile=None, capath='/etc/ssl/certs', ...)
# If cafile is None and capath is empty — Python has no CA bundle
# Check if the CA bundle exists
ls -la /etc/ssl/certs/ca-certificates.crt
Fix:
# On Alpine-based Docker images — install CA certificates
apk add --no-cache ca-certificates
# On Debian/Ubuntu Docker images
apt-get update && apt-get install -y ca-certificates
# In Python — point to a specific CA bundle
import os
os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-certificates.crt"
os.environ["REQUESTS_CA_BUNDLE"] = "/etc/ssl/certs/ca-certificates.crt"
# Or install certifi (Python's bundled CA certs)
pip install certifi
The "fix" you will see on Stack Overflow is verify=False or PYTHONHTTPSVERIFY=0. NEVER do this in production. It disables ALL certificate verification, meaning any attacker can impersonate any server. Always fix the root cause by installing the correct CA bundle.
Error 9: "x509: certificate relies on legacy Common Name field"
Full error: x509: certificate relies on legacy Common Name field, use SANs instead
What happened: Go 1.15+ (and other modern TLS libraries) no longer accept certificates that use only the Common Name (CN) field for the hostname. The certificate must include Subject Alternative Names (SANs).
Diagnostic:
# Check if the cert has SANs
openssl x509 -text -noout -in cert.pem | grep -A1 "Subject Alternative Name"
# If this outputs nothing — no SANs, only CN
# Check the CN
openssl x509 -noout -subject -in cert.pem
# subject=CN = myservice.internal
Fix: Regenerate the certificate with SANs:
# Generate a cert with SANs using OpenSSL
openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes \
-subj "/CN=myservice.internal" \
-addext "subjectAltName=DNS:myservice.internal,DNS:myservice.default.svc.cluster.local"
This error is especially common with older internal certificates that were generated before SANs became mandatory. If you have legacy internal PKI generating CN-only certs, update the templates to include SANs. Every certificate issued today should have SANs — CN is only a display name, not used for validation.
Error 10: "SEC_ERROR_UNKNOWN_ISSUER" (Firefox)
Full error: Firefox shows SEC_ERROR_UNKNOWN_ISSUER with a warning page
What happened: Firefox maintains its own trust store separate from the operating system. While Chrome and Edge use the OS trust store (Keychain on macOS, certutil on Windows, /etc/ssl on Linux), Firefox bundles its own set of trusted root CAs via the Mozilla NSS library.
Diagnostic:
If Chrome works but Firefox does not:
- The certificate may use a root CA that is in the OS trust store but not in Mozilla's trust store
- Or the server is missing an intermediate that Chrome fetches via AIA but Firefox does not have cached
# Check if the root is in Mozilla's list
# https://wiki.mozilla.org/CA/Included_Certificates
# Or check with openssl (simulates a non-browser client)
openssl s_client -connect host:443 -servername host 2>&1 | grep "Verify return"
# If openssl also fails — the problem is the server, not Firefox
Fix: Ensure the server sends the complete chain and uses a root CA trusted by all major platforms. If using an internal CA, add it to Firefox via enterprise policy or about:config:
// Firefox enterprise policy (policies.json)
{
"policies": {
"Certificates": {
"ImportEnterpriseRoots": true
}
}
}
Part 3: The Certificate Debugging Decision Tree
When you hit a TLS error, work through this flowchart:
Certificate Debugging Decision Tree
Click each step to explore
Work through this tree in order. Do not skip steps. The most common mistakes in certificate debugging are (1) assuming the problem is the certificate when it is actually a network issue and (2) assuming the certificate is wrong when actually the intermediate is just missing. The decision tree keeps you methodical.
Part 4: Kubernetes-Specific Certificate Issues
Kubernetes adds its own layer of certificate complexity. Here are the most common issues:
Ingress Certificate Not Loading
The TLS Secret name in the Ingress does not match the actual Secret name, or the Secret is in the wrong namespace.
# Check what Secret the Ingress references
kubectl get ingress my-ingress -o jsonpath='{.spec.tls[*].secretName}'
# my-tls-cert
# Check if that Secret exists in the SAME namespace
kubectl get secret my-tls-cert
# Error: secrets "my-tls-cert" not found ← this is your problem
# The Secret must be in the SAME namespace as the Ingress
kubectl get secret my-tls-cert -n correct-namespace
cert-manager Not Renewing
cert-manager automates certificate lifecycle but failures can be silent.
# Check Certificate resource status
kubectl get certificate -A
# NAME READY SECRET AGE
# my-cert False my-tls-cert 90d ← False = renewal failed
# Check the Certificate status conditions
kubectl describe certificate my-cert
# Conditions:
# Type: Ready
# Status: False
# Reason: DoesNotExist
# Message: Issuing certificate as Secret does not exist
# Check the Order (the ACME order for renewal)
kubectl get order -A
# Check the Challenge (the domain verification step)
kubectl get challenge -A
# Common failure: DNS solver cannot create TXT records
kubectl describe challenge my-cert-xxxxx
# State: pending
# Reason: waiting for DNS record to propagate
A team had cert-manager running for two years without issues. Then they upgraded their DNS provider's API credentials, but forgot to update the cert-manager Secret that held the API token. Renewals failed silently for 60 days because cert-manager retries with backoff. The cert expired, and all HTTPS traffic broke. The cert-manager logs had been warning for weeks, but nobody was watching them. Set up alerts on cert-manager Certificate Ready status — if any Certificate shows Ready=False for more than 1 hour, page someone.
Webhook "x509: certificate signed by unknown authority"
Kubernetes admission webhooks (ValidatingWebhookConfiguration, MutatingWebhookConfiguration) use TLS between the API server and the webhook service. If the CA bundle is wrong, you get x509 errors.
# Check the webhook configuration
kubectl get validatingwebhookconfiguration my-webhook -o yaml | grep caBundle
# The caBundle must contain the base64-encoded CA certificate that signed
# the webhook service's TLS cert
# If using cert-manager, the caBundle is injected automatically via annotations:
# cert-manager.io/inject-ca-from: namespace/certificate-name
kubelet Certificate Expired
Kubernetes node certificates (for kubelet communication with the API server) are managed by kubeadm and expire after 1 year by default.
# Check all Kubernetes certificate expirations
kubeadm certs check-expiration
# Renew all certificates
kubeadm certs renew all
# Restart kubelet and control plane components after renewal
systemctl restart kubelet
Set up monitoring for Kubernetes certificate expiry. The x509-certificate-exporter Prometheus exporter can scrape all certificate files on the filesystem and all TLS Secrets in the cluster, exposing days-until-expiry as a metric. Alert when any cert is within 30 days of expiry. This catches kubelet certs, etcd certs, webhook certs, and ingress certs — everything.
Part 5: Prevention — Never Get Woken Up for an Expired Cert
The best debugging is the debugging you never have to do. Here is how to prevent certificate issues:
Automated Renewal
- Public services: Use cert-manager with Let's Encrypt ClusterIssuer for automatic issuance and renewal
- Internal services: Use cert-manager with a Vault or step-ca ClusterIssuer
- Non-Kubernetes: Use certbot with a cron job or systemd timer
# certbot auto-renewal (usually set up by default)
systemctl status certbot.timer
# If inactive, enable it:
systemctl enable --now certbot.timer
Monitoring and Alerting
# Install x509-certificate-exporter for Prometheus
helm install x509-exporter \
enix/x509-certificate-exporter \
--set secretsExporter.enabled=true
# Prometheus alerting rule
# alert: CertificateExpiringSoon
# expr: x509_cert_not_after - time() < 86400 * 30
# labels:
# severity: warning
# annotations:
# summary: "Certificate {{ $labels.subject_CN }} expires in less than 30 days"
Common Certificate Errors Mapped to Fixes
Error
What you see
Fix
What you do
Part 6: Going Deeper
This module covered the essentials of SSL/TLS — enough to understand encryption, debug certificate errors, and keep production running. But there is much more to the TLS ecosystem:
- TLS protocol internals: record layer, alert protocol, content types, key derivation functions
- Kubernetes PKI: the full certificate architecture of a Kubernetes cluster (API server, kubelet, etcd, front-proxy, service account signing)
- cert-manager deep dive: ClusterIssuers, Certificate resources, ACME solvers, trust-manager, policy-approver
- Mutual TLS (mTLS): client certificates, service mesh identity, SPIFFE/SPIRE
- Certificate Transparency: CT logs, SCTs, how browsers detect misissued certificates
- OCSP and CRL: certificate revocation mechanisms and their failure modes
For the full deep dive into these topics, see the SSL/TLS and Certificate Management course.
Key Concepts Summary
- Five essential OpenSSL commands:
s_client,x509 -text,verify,x509 -dates,s_client -showcerts— these diagnose 95% of certificate problems - The most common error is a missing intermediate certificate — the server sends only the leaf, non-browser clients cannot verify the chain
- Expired certificates are the second most common — auto-renewal (certbot, cert-manager) fails silently if credentials or DNS are misconfigured
- Domain mismatch means the certificate SAN field does not contain the hostname the client connected to
- Self-signed certificates provide encryption but no trust — never use them in production
- Python CERTIFICATE_VERIFY_FAILED usually means the Docker container is missing the
ca-certificatespackage - Go requires SANs since version 1.15 — certificates with only CN and no SAN are rejected
- Firefox has its own trust store separate from the OS — a cert can work in Chrome but fail in Firefox
- Kubernetes cert issues include mismatched Secret names, cert-manager renewal failures, webhook CA bundle problems, and kubelet cert expiry
- Prevention beats debugging: use cert-manager for auto-renewal, monitor with x509-certificate-exporter, alert at 30 days before expiry
Common Mistakes
- Running
verify=FalseorPYTHONHTTPSVERIFY=0to "fix" certificate errors — this disables all security and masks the real problem - Not using
-servernamewithopenssl s_clientand getting the wrong certificate back from multi-domain servers - Testing certificate changes only in Chrome, which masks missing intermediates via AIA fetching
- Assuming auto-renewal is working because it worked last time — always monitor and alert on certificate expiry
- Forgetting that Kubernetes Secrets for TLS must be in the same namespace as the Ingress that references them
- Not checking cert-manager Order and Challenge resources when renewal fails — the Certificate resource just says "not ready" without details
Module 7 Complete
You now have a working understanding of SSL/TLS — from the cryptographic primitives that make HTTPS possible, through the TLS handshake protocol, to the certificate chain of trust, and the practical debugging skills to fix certificate errors at 2 AM.
This is the knowledge that separates engineers who restart nginx and hope for the best from engineers who diagnose the root cause in five minutes with openssl s_client. Keep these lessons bookmarked. You will use them.
Your monitoring shows that an internal API is returning TLS errors. You run openssl s_client and see 'Verify return code: 21 (unable to verify the first certificate)'. What is the most likely cause and fix?