Networking Fundamentals for Engineers

Debugging Certificate Errors in Production

It is 2 AM. PagerDuty is screaming. Your monitoring dashboard shows HTTPS failures spiking across three services. Customers are seeing "Your connection is not private" errors. The on-call Slack channel is filling up with "Is the site down?"

You SSH into the load balancer. The logs say: SSL certificate problem: unable to verify the first certificate.

You do not have time to read documentation. You need the exact error, the exact cause, and the exact fix. That is what this lesson is, a field guide to every certificate error you will encounter in production, with the OpenSSL commands to diagnose and fix each one.

Part 1: The OpenSSL Debugging Toolkit

Before we get to the errors, you need five OpenSSL commands burned into muscle memory. These five commands will diagnose 95% of certificate problems.

# 1. Test the TLS connection (the first thing you run)
openssl s_client -connect host:443 -servername host

# 2. Read a certificate file (decode what is inside)
openssl x509 -text -noout -in cert.pem

# 3. Verify the certificate chain (is the chain valid?)
openssl verify -CAfile ca-bundle.crt cert.pem

# 4. Check certificate expiry (the most common problem)
openssl x509 -noout -dates -in cert.pem

# 5. See the full chain from a remote server
openssl s_client -showcerts -connect host:443 -servername host

KEY CONCEPT

Memorize these five commands. During an outage, you do not have time to search Stack Overflow. Run command 1 first: it shows you the protocol version, cipher, certificate chain, and verification result in one shot. That single output will point you to the problem 80% of the time.

Let us also add a one-liner that combines the most common checks into a single command:

# The "tell me everything about this server's TLS" one-liner
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
  | openssl x509 -noout -subject -issuer -dates -ext subjectAltName

# Output:
# subject=CN = devopsbeast.com
# issuer=C = US, O = Let's Encrypt, CN = R3
# notBefore=Mar  1 00:00:00 2026 GMT
# notAfter=May 30 23:59:59 2026 GMT
# X509v3 Subject Alternative Name:
#     DNS:devopsbeast.com, DNS:*.devopsbeast.com

PRO TIP

Add this as a shell alias: alias tlscheck='f(){ echo | openssl s_client -connect "$1" -servername "${1%%:*}" 2>/dev/null | openssl x509 -noout -subject -issuer -dates -ext subjectAltName; }; f'. Then you can just run tlscheck devopsbeast.com:443 from anywhere.

Part 2: The 10 Certificate Errors

Error 1: "certificate has expired"

Full error: verify error:num=10:certificate has expired

What happened: The certificate's Not After date has passed. The certificate was valid but is no longer.

Diagnostic:

# Check expiry from the remote server
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
  | openssl x509 -noout -dates
# notAfter=Jan 15 23:59:59 2026 GMT   ← this date has passed

# Check expiry of a local cert file
openssl x509 -noout -enddate -in /etc/ssl/certs/server.crt

Fix:

Renew the certificate (Let's Encrypt: certbot renew; manual: request new cert from CA)
Deploy the new certificate to the server
Reload the web server (nginx -s reload, systemctl reload apache2)

# Force renewal with certbot
certbot renew --force-renewal

# Verify the new cert is loaded
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
  | openssl x509 -noout -dates
# notAfter should now be in the future

WAR STORY

A company ran cert-manager in Kubernetes for automated renewal. It worked perfectly for 18 months. Then the cluster was migrated to a new namespace, but the cert-manager ClusterIssuer still referenced the old namespace for the DNS solver ServiceAccount. Renewals silently failed for 60 days. The cert expired on a Saturday night. Lesson: monitor certificate expiry with alerts at 30, 14, and 7 days. Do not assume automation is working just because it worked before.

Error 2: "unable to verify the first certificate"

Full error: verify error:num=21:unable to verify the first certificate

What happened: The server sent the leaf certificate but NOT the intermediate certificate. The client cannot verify who signed the leaf.

Diagnostic:

# Check what certs the server sends
openssl s_client -showcerts -connect host:443 -servername host 2>/dev/null

# If you see only ONE "BEGIN CERTIFICATE" block — the intermediate is missing
# You should see TWO: the leaf and the intermediate

Fix: Configure the server to send the full chain (leaf + intermediate):

# Create the full chain file
cat leaf.crt intermediate.crt > fullchain.crt

# nginx: use the fullchain
# ssl_certificate /etc/ssl/fullchain.crt;

# Kubernetes: recreate the Secret with the full chain
kubectl create secret tls my-tls --cert=fullchain.crt --key=private.key --dry-run=client -o yaml \
  | kubectl apply -f -

WARNING

This is the most common certificate error in production. It is especially insidious because desktop browsers often work fine (they fetch the missing intermediate via AIA), so the engineer testing the deployment sees no error. The failure only shows up in API clients, mobile apps, monitoring tools, and CI pipelines, often hours or days after deployment.

Error 3: "certificate signed by unknown authority"

Full error: x509: certificate signed by unknown authority

What happened: The certificate chain leads to a root CA that is not in the client's trust store. This happens with:

Self-signed certificates
Private/internal CAs
Certificates from CAs not trusted by the client OS

Diagnostic:

# Check the issuer chain
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
  | grep -E "issuer|Verify return"

# Verify return code: 19 (self-signed certificate in certificate chain)
# OR
# Verify return code: 20 (unable to get local issuer certificate)

Fix: Either use a publicly trusted CA (Let's Encrypt) or add the CA to the client's trust store:

# On Ubuntu/Debian: add a custom CA
cp custom-ca.crt /usr/local/share/ca-certificates/
update-ca-certificates

# On RHEL/CentOS: add a custom CA
cp custom-ca.crt /etc/pki/ca-trust/source/anchors/
update-ca-trust

# In Python requests
import requests
response = requests.get("https://internal-service.corp", verify="/path/to/ca-bundle.crt")

# In curl
curl --cacert /path/to/ca-bundle.crt https://internal-service.corp

Error 4: "certificate is valid for X, not Y"

Full error: x509: certificate is valid for app.example.com, not api.example.com

What happened: The domain in the URL does not match any domain in the certificate's Subject Alternative Name (SAN) field.

Diagnostic:

# Check what domains the cert covers
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
  | openssl x509 -noout -ext subjectAltName
# X509v3 Subject Alternative Name:
#     DNS:app.example.com, DNS:www.example.com
# Missing: api.example.com

Fix: Get a new certificate that includes the correct domain, or use a wildcard:

# Wildcard cert covers all subdomains
certbot certonly --dns-cloudflare -d "*.example.com" -d "example.com"

PRO TIP

When you see domain mismatch errors, also check if the client is connecting to the correct IP. A common cause is DNS returning a different server (e.g., a CDN edge server, a load balancer with a default cert, or a different service on the same IP). Run dig host and verify the IP, then run openssl s_client -connect IP:443 -servername host to check what cert that IP serves.

Error 5: "wrong version number"

Full error: SSL routines:ssl3_get_record:wrong version number

What happened: You are trying to make a TLS connection to a port that is serving plain HTTP (not HTTPS). The client sends a TLS ClientHello, but the server responds with an HTTP response. The TLS parser cannot decode HTTP as a TLS record.

Diagnostic:

# This will show the error
openssl s_client -connect host:80 -servername host
# CONNECTED(00000003)
# 140234:error:1408F10B:SSL routines:ssl3_get_record:wrong version number

# Try connecting to the right port
openssl s_client -connect host:443 -servername host

Fix: Connect to port 443 (HTTPS) instead of port 80 (HTTP). Or check if the server has TLS configured at all:

# Check if the port speaks HTTP or HTTPS
curl -v http://host:8080 2>&1 | head -5
# If you see "HTTP/1.1 200" — it is HTTP, not HTTPS

curl -v https://host:8080 2>&1 | head -5
# If you see "wrong version number" — port 8080 is HTTP, not HTTPS

Error 6: "certificate verify failed (self-signed certificate)"

Full error: SSL: CERTIFICATE_VERIFY_FAILED - self-signed certificate

What happened: The server is presenting a self-signed certificate (not signed by any CA). The client's trust store does not contain this certificate.

Diagnostic:

echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
  | grep "Verify return"
# Verify return code: 18 (self-signed certificate)

# Confirm: issuer matches subject
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
  | openssl x509 -noout -subject -issuer
# subject=CN = myservice.local
# issuer=CN = myservice.local    ← same as subject = self-signed

Fix: Replace with a certificate from a trusted CA. For internal services, use a proper internal CA (step-ca, HashiCorp Vault PKI, cert-manager with a private ClusterIssuer).

Error 7: "remote error: tls: bad certificate"

Full error: remote error: tls: bad certificate (Go) or SSL: SSLV3_ALERT_BAD_CERTIFICATE (OpenSSL)

What happened: The server requires mutual TLS (mTLS): it sent a CertificateRequest during the handshake, but the client did not provide a client certificate, or the client certificate was rejected.

Diagnostic:

# Check if the server requests a client certificate
openssl s_client -connect host:443 -servername host 2>/dev/null | grep "Acceptable client"
# If you see CA names listed — the server requires mTLS

# Connect with a client certificate
openssl s_client -connect host:443 -servername host \
  -cert client.crt -key client.key

Fix: Provide a valid client certificate that is signed by a CA the server trusts:

# In curl
curl --cert client.crt --key client.key https://host/api

# In Python
requests.get("https://host/api", cert=("client.crt", "client.key"))

KEY CONCEPT

Mutual TLS (mTLS) is increasingly common in Kubernetes environments, service meshes like Istio and Linkerd use mTLS for all pod-to-pod communication. If you see "bad certificate" errors in a service mesh environment, check whether the sidecar proxy has a valid client certificate and whether the mesh CA (Citadel, trust-manager) is healthy.

Error 8: "SSL: CERTIFICATE_VERIFY_FAILED" (Python)

Full error: ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate

What happened: Python cannot find the system CA bundle. This is common in:

Docker containers with minimal base images (alpine, scratch)
macOS after a Python upgrade (Python does not use the macOS Keychain by default)
Virtual environments that do not inherit system certificates

Diagnostic:

# Check where Python looks for CA certs
python3 -c "import ssl; print(ssl.get_default_verify_paths())"
# DefaultVerifyPaths(cafile=None, capath='/etc/ssl/certs', ...)

# If cafile is None and capath is empty — Python has no CA bundle

# Check if the CA bundle exists
ls -la /etc/ssl/certs/ca-certificates.crt

Fix:

# On Alpine-based Docker images — install CA certificates
apk add --no-cache ca-certificates

# On Debian/Ubuntu Docker images
apt-get update && apt-get install -y ca-certificates

# In Python — point to a specific CA bundle
import os
os.environ["SSL_CERT_FILE"] = "/etc/ssl/certs/ca-certificates.crt"
os.environ["REQUESTS_CA_BUNDLE"] = "/etc/ssl/certs/ca-certificates.crt"

# Or install certifi (Python's bundled CA certs)
pip install certifi

WARNING

The "fix" you will see on Stack Overflow is verify=False or PYTHONHTTPSVERIFY=0. NEVER do this in production. It disables ALL certificate verification, meaning any attacker can impersonate any server. Always fix the root cause by installing the correct CA bundle.

Error 9: "x509: certificate relies on legacy Common Name field"

Full error: x509: certificate relies on legacy Common Name field, use SANs instead

What happened: Go 1.15+ (and other modern TLS libraries) no longer accept certificates that use only the Common Name (CN) field for the hostname. The certificate must include Subject Alternative Names (SANs).

Diagnostic:

# Check if the cert has SANs
openssl x509 -text -noout -in cert.pem | grep -A1 "Subject Alternative Name"
# If this outputs nothing — no SANs, only CN

# Check the CN
openssl x509 -noout -subject -in cert.pem
# subject=CN = myservice.internal

Fix: Regenerate the certificate with SANs:

# Generate a cert with SANs using OpenSSL
openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes \
  -subj "/CN=myservice.internal" \
  -addext "subjectAltName=DNS:myservice.internal,DNS:myservice.default.svc.cluster.local"

PRO TIP

This error is especially common with older internal certificates that were generated before SANs became mandatory. If you have legacy internal PKI generating CN-only certs, update the templates to include SANs. Every certificate issued today should have SANs. CN is only a display name, not used for validation.

Error 10: "SEC_ERROR_UNKNOWN_ISSUER" (Firefox)

Full error: Firefox shows SEC_ERROR_UNKNOWN_ISSUER with a warning page

What happened: Firefox maintains its own trust store separate from the operating system. While Chrome and Edge use the OS trust store (Keychain on macOS, certutil on Windows, /etc/ssl on Linux), Firefox bundles its own set of trusted root CAs via the Mozilla NSS library.

Diagnostic:

If Chrome works but Firefox does not:

The certificate may use a root CA that is in the OS trust store but not in Mozilla's trust store
Or the server is missing an intermediate that Chrome fetches via AIA but Firefox does not have cached

# Check if the root is in Mozilla's list
# https://wiki.mozilla.org/CA/Included_Certificates

# Or check with openssl (simulates a non-browser client)
openssl s_client -connect host:443 -servername host 2>&1 | grep "Verify return"
# If openssl also fails — the problem is the server, not Firefox

Fix: Ensure the server sends the complete chain and uses a root CA trusted by all major platforms. If using an internal CA, add it to Firefox via enterprise policy or about:config:

// Firefox enterprise policy (policies.json)
{
  "policies": {
    "Certificates": {
      "ImportEnterpriseRoots": true
    }
  }
}

Part 3: The Certificate Debugging Decision Tree

When you hit a TLS error, work through this flowchart:

Certificate Debugging Decision Tree

Click each step to explore

KEY CONCEPT

Work through this tree in order. Do not skip steps. The most common mistakes in certificate debugging are (1) assuming the problem is the certificate when it is actually a network issue and (2) assuming the certificate is wrong when actually the intermediate is just missing. The decision tree keeps you methodical.

Part 4: Kubernetes-Specific Certificate Issues

Kubernetes adds its own layer of certificate complexity. Here are the most common issues:

Ingress Certificate Not Loading

The TLS Secret name in the Ingress does not match the actual Secret name, or the Secret is in the wrong namespace.

# Check what Secret the Ingress references
kubectl get ingress my-ingress -o jsonpath='{.spec.tls[*].secretName}'
# my-tls-cert

# Check if that Secret exists in the SAME namespace
kubectl get secret my-tls-cert
# Error: secrets "my-tls-cert" not found  ← this is your problem

# The Secret must be in the SAME namespace as the Ingress
kubectl get secret my-tls-cert -n correct-namespace

cert-manager Not Renewing

cert-manager automates certificate lifecycle but failures can be silent.

# Check Certificate resource status
kubectl get certificate -A
# NAME          READY   SECRET        AGE
# my-cert       False   my-tls-cert   90d    ← False = renewal failed

# Check the Certificate status conditions
kubectl describe certificate my-cert
# Conditions:
#   Type: Ready
#   Status: False
#   Reason: DoesNotExist
#   Message: Issuing certificate as Secret does not exist

# Check the Order (the ACME order for renewal)
kubectl get order -A
# Check the Challenge (the domain verification step)
kubectl get challenge -A

# Common failure: DNS solver cannot create TXT records
kubectl describe challenge my-cert-xxxxx
# State: pending
# Reason: waiting for DNS record to propagate

WAR STORY

A team had cert-manager running for two years without issues. Then they upgraded their DNS provider's API credentials, but forgot to update the cert-manager Secret that held the API token. Renewals failed silently for 60 days because cert-manager retries with backoff. The cert expired, and all HTTPS traffic broke. The cert-manager logs had been warning for weeks, but nobody was watching them. Set up alerts on cert-manager Certificate Ready status, if any Certificate shows Ready=False for more than 1 hour, page someone.

Webhook "x509: certificate signed by unknown authority"

Kubernetes admission webhooks (ValidatingWebhookConfiguration, MutatingWebhookConfiguration) use TLS between the API server and the webhook service. If the CA bundle is wrong, you get x509 errors.

# Check the webhook configuration
kubectl get validatingwebhookconfiguration my-webhook -o yaml | grep caBundle

# The caBundle must contain the base64-encoded CA certificate that signed
# the webhook service's TLS cert

# If using cert-manager, the caBundle is injected automatically via annotations:
# cert-manager.io/inject-ca-from: namespace/certificate-name

kubelet Certificate Expired

Kubernetes node certificates (for kubelet communication with the API server) are managed by kubeadm and expire after 1 year by default.

# Check all Kubernetes certificate expirations
kubeadm certs check-expiration

# Renew all certificates
kubeadm certs renew all

# Restart kubelet and control plane components after renewal
systemctl restart kubelet

PRO TIP

Set up monitoring for Kubernetes certificate expiry. The x509-certificate-exporter Prometheus exporter can scrape all certificate files on the filesystem and all TLS Secrets in the cluster, exposing days-until-expiry as a metric. Alert when any cert is within 30 days of expiry. This catches kubelet certs, etcd certs, webhook certs, and ingress certs, everything.

Part 5: Prevention: Never Get Woken Up for an Expired Cert

The best debugging is the debugging you never have to do. Here is how to prevent certificate issues:

Automated Renewal

Public services: Use cert-manager with Let's Encrypt ClusterIssuer for automatic issuance and renewal
Internal services: Use cert-manager with a Vault or step-ca ClusterIssuer
Non-Kubernetes: Use certbot with a cron job or systemd timer

# certbot auto-renewal (usually set up by default)
systemctl status certbot.timer
# If inactive, enable it:
systemctl enable --now certbot.timer

Monitoring and Alerting

# Install x509-certificate-exporter for Prometheus
helm install x509-exporter \
  enix/x509-certificate-exporter \
  --set secretsExporter.enabled=true

# Prometheus alerting rule
# alert: CertificateExpiringSoon
# expr: x509_cert_not_after - time() < 86400 * 30
# labels:
#   severity: warning
# annotations:
#   summary: "Certificate {{ $labels.subject_CN }} expires in less than 30 days"

Common Certificate Errors Mapped to Fixes

Error

What you see

Error 1certificate has expired

Error 2unable to verify the first certificate

Error 3signed by unknown authority

Error 4certificate is valid for X, not Y

Error 5wrong version number

Error 6self-signed certificate

Error 7tls: bad certificate (mTLS)

Error 8CERTIFICATE_VERIFY_FAILED (Python)

Error 9relies on legacy Common Name

Error 10SEC_ERROR_UNKNOWN_ISSUER (Firefox)

Fix

What you do

Fix 1Renew cert, reload server

Fix 2Send fullchain (leaf + intermediate)

Fix 3Use trusted CA or add CA to trust store

Fix 4Get cert with correct SAN or use wildcard

Fix 5Connect to HTTPS port (443), not HTTP (80)

Fix 6Replace with CA-signed cert

Fix 7Provide valid client cert for mTLS

Fix 8Install ca-certificates in container

Fix 9Regenerate cert with SAN extension

Fix 10Ensure chain is complete, use Mozilla-trusted CA

Part 6: Going Deeper

This module covered the essentials of SSL/TLS: enough to understand encryption, debug certificate errors, and keep production running. But there is much more to the TLS ecosystem:

TLS protocol internals: record layer, alert protocol, content types, key derivation functions
Kubernetes PKI: the full certificate architecture of a Kubernetes cluster (API server, kubelet, etcd, front-proxy, service account signing)
cert-manager deep dive: ClusterIssuers, Certificate resources, ACME solvers, trust-manager, policy-approver
Mutual TLS (mTLS): client certificates, service mesh identity, SPIFFE/SPIRE
Certificate Transparency: CT logs, SCTs, how browsers detect misissued certificates
OCSP and CRL: certificate revocation mechanisms and their failure modes

For the full deep dive into these topics, see the SSL/TLS and Certificate Management course.

Key Concepts Summary

Five essential OpenSSL commands: s_client, x509 -text, verify, x509 -dates, s_client -showcerts, these diagnose 95% of certificate problems
The most common error is a missing intermediate certificate, the server sends only the leaf, non-browser clients cannot verify the chain
Expired certificates are the second most common, auto-renewal (certbot, cert-manager) fails silently if credentials or DNS are misconfigured
Domain mismatch means the certificate SAN field does not contain the hostname the client connected to
Self-signed certificates provide encryption but no trust, never use them in production
Python CERTIFICATE_VERIFY_FAILED usually means the Docker container is missing the ca-certificates package
Go requires SANs since version 1.15, certificates with only CN and no SAN are rejected
Firefox has its own trust store separate from the OS, a cert can work in Chrome but fail in Firefox
Kubernetes cert issues include mismatched Secret names, cert-manager renewal failures, webhook CA bundle problems, and kubelet cert expiry
Prevention beats debugging: use cert-manager for auto-renewal, monitor with x509-certificate-exporter, alert at 30 days before expiry

Common Mistakes

Running verify=False or PYTHONHTTPSVERIFY=0 to "fix" certificate errors, this disables all security and masks the real problem
Not using -servername with openssl s_client and getting the wrong certificate back from multi-domain servers
Testing certificate changes only in Chrome, which masks missing intermediates via AIA fetching
Assuming auto-renewal is working because it worked last time, always monitor and alert on certificate expiry
Forgetting that Kubernetes Secrets for TLS must be in the same namespace as the Ingress that references them
Not checking cert-manager Order and Challenge resources when renewal fails, the Certificate resource just says "not ready" without details

Module 7 Complete

You now have a working understanding of SSL/TLS, from the cryptographic primitives that make HTTPS possible, through the TLS handshake protocol, to the certificate chain of trust, and the practical debugging skills to fix certificate errors at 2 AM.

This is the knowledge that separates engineers who restart nginx and hope for the best from engineers who diagnose the root cause in five minutes with openssl s_client. Keep these lessons bookmarked. You will use them.

KNOWLEDGE CHECK

Your monitoring shows that an internal API is returning TLS errors. You run openssl s_client and see 'Verify return code: 21 (unable to verify the first certificate)'. What is the most likely cause and fix?

Certificates & the Chain of Trust

Course Complete

←→ navigateM toggle sidebar