SSL Certificate Monitoring for DevOps & SRE Teams
SSL certificate expiry is a leading cause of production incidents — and one of the most preventable. For DevOps and SRE teams, certificate management should be treated as infrastructure: automated, monitored, and included in runbooks. This guide covers integrating certificate health into your existing toolchain, from CI/CD pipelines to Prometheus alerting and on-call playbooks.
Why SSL Expires Despite Automation
Teams assume automation solves the problem. It often does not, because:
Method 1: SSL Checks in CI/CD Pipelines
Add a certificate expiry check as a step in your deployment pipeline. This catches certificates that are about to expire before you deploy new code and claim the release is healthy.
# Bash script — check SSL expiry (fails CI if < 14 days remaining)
#!/bin/bash DOMAIN="yourdomain.com" WARN_DAYS=14 EXPIRY=$(echo | openssl s_client -servername "$DOMAIN" -connect "$DOMAIN:443" 2>/dev/null \ | openssl x509 -noout -enddate 2>/dev/null \ | cut -d= -f2) EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -j -f "%b %e %H:%M:%S %Y %Z" "$EXPIRY" +%s) NOW_EPOCH=$(date +%s) DAYS_REMAINING=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 )) echo "Certificate expires in $DAYS_REMAINING days ($EXPIRY)" if [ "$DAYS_REMAINING" -lt "$WARN_DAYS" ]; then echo "ERROR: Certificate expires in less than $WARN_DAYS days!" exit 1 fi echo "OK: Certificate is valid for $DAYS_REMAINING more days"
# GitHub Actions step
- name: Check SSL certificate expiry
run: |
./scripts/check-ssl.sh
env:
DOMAIN: ${{ vars.PRODUCTION_DOMAIN }}Method 2: Prometheus + Alertmanager
The blackbox_exporter is the standard Prometheus tool for probing endpoints and exposing certificate metrics.
# blackbox_exporter config
modules:
https_2xx:
prober: http
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
tls_config:
insecure_skip_verify: false# Prometheus alert rules
groups:
- name: ssl_certificates
rules:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon"
description: "{{ $labels.instance }} cert expires in {{ $value | humanizeDuration }}"
- alert: SSLCertExpiringCritical
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
for: 1h
labels:
severity: critical
annotations:
summary: "SSL certificate expiring CRITICAL"
description: "{{ $labels.instance }} cert expires in {{ $value | humanizeDuration }}"
- alert: SSLCertExpired
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 5m
labels:
severity: critical
annotations:
summary: "SSL certificate has EXPIRED"
description: "{{ $labels.instance }} SSL certificate is expired!"Method 3: Kubernetes — cert-manager Health Checks
# Check all certificates across namespaces kubectl get certificates --all-namespaces # Watch for not-ready certificates kubectl get certificates -A | grep -v "True" # Inspect a specific certificate kubectl describe certificate my-cert -n production # Check events for renewal failures kubectl get events -n production --field-selector reason=Failed # Useful for alerting: cert-manager exports Prometheus metrics at :9402/metrics # Key metric: certmanager_certificate_expiration_timestamp_seconds
Method 4: Infrastructure-as-Code (Terraform)
Track certificate expiry as Terraform data sources to catch drift between your IaC definition and the live certificate:
# Terraform — use tls_certificate data source to read live cert details
data "tls_certificate" "production" {
url = "https://yourdomain.com"
}
output "cert_expiry" {
value = data.tls_certificate.production.certificates[0].not_after
}
output "cert_issuer" {
value = data.tls_certificate.production.certificates[0].issuer
}
# Fail plan if cert expires within 30 days
locals {
expiry_date = timeadd(data.tls_certificate.production.certificates[0].not_after, "0s")
days_remaining = ceil((parseint(local.expiry_date, 10) - parseint(timestamp(), 10)) / 86400)
}Incident Response Runbook: Expired SSL Certificate
Confirm the incident
Run: openssl s_client -connect yourdomain.com:443 < /dev/null 2>&1 | openssl x509 -noout -dates. Confirm the certificate is expired or about to expire.
Identify the renewal method
Is this managed by Certbot? cert-manager? Cloud ACM? Find the renewal configuration immediately.
Emergency manual renewal (Certbot)
sudo certbot renew --force-renewal -d yourdomain.com then sudo systemctl reload nginx
Communicate status
Post to status page. Notify customer-facing teams. Internal SLA: resolve within 30 minutes of detection.
Post-incident review
Identify why monitoring did not catch this 14+ days in advance. Add or fix the monitoring gap before closing the ticket.
External monitoring as your safety net
Internal Prometheus and CI checks are great. But external monitoring from outside your network catches issues that internal tools miss: CDN-layer certificates, load balancer misconfigurations, and geo-specific failures. CertNotify provides this external check as a safety net — free for up to 3 domains.
Add external monitoring →