SSL for DevOps: Monitoring Certificates in CI/CD Pipelines – CertNotify

SSL certificate expiry is a leading cause of production incidents — and one of the most preventable. For DevOps and SRE teams, certificate management should be treated as infrastructure: automated, monitored, and included in runbooks. This guide covers integrating certificate health into your existing toolchain, from CI/CD pipelines to Prometheus alerting and on-call playbooks.

Why SSL Expires Despite Automation

Teams assume automation solves the problem. It often does not, because:

•Certbot renewal jobs fail silently — a cron error goes unnoticed until the certificate expires

•DNS-01 challenge tokens expire before the renewal runs if DNS TTL is too high

•Cloud provider certificate managers fail when IAM permissions change

•Certificates are manually uploaded and overwrite auto-managed ones

•Load balancers are misconfigured to serve old cached certificates after renewal

•Certificates exist on services that no longer receive automated renewal (legacy apps, internal APIs)

Method 1: SSL Checks in CI/CD Pipelines

Add a certificate expiry check as a step in your deployment pipeline. This catches certificates that are about to expire before you deploy new code and claim the release is healthy.

# Bash script — check SSL expiry (fails CI if < 14 days remaining)

#!/bin/bash
DOMAIN="yourdomain.com"
WARN_DAYS=14

EXPIRY=$(echo | openssl s_client -servername "$DOMAIN" -connect "$DOMAIN:443" 2>/dev/null \
  | openssl x509 -noout -enddate 2>/dev/null \
  | cut -d= -f2)

EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -j -f "%b %e %H:%M:%S %Y %Z" "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_REMAINING=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))

echo "Certificate expires in $DAYS_REMAINING days ($EXPIRY)"

if [ "$DAYS_REMAINING" -lt "$WARN_DAYS" ]; then
  echo "ERROR: Certificate expires in less than $WARN_DAYS days!"
  exit 1
fi
echo "OK: Certificate is valid for $DAYS_REMAINING more days"

# GitHub Actions step

- name: Check SSL certificate expiry
  run: |
    ./scripts/check-ssl.sh
  env:
    DOMAIN: ${{ vars.PRODUCTION_DOMAIN }}

Method 2: Prometheus + Alertmanager

The blackbox_exporter is the standard Prometheus tool for probing endpoints and exposing certificate metrics.

# blackbox_exporter config

modules:
  https_2xx:
    prober: http
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      tls_config:
        insecure_skip_verify: false

# Prometheus alert rules

groups:
- name: ssl_certificates
  rules:
  - alert: SSLCertExpiringSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate expiring soon"
      description: "{{ $labels.instance }} cert expires in {{ $value | humanizeDuration }}"

  - alert: SSLCertExpiringCritical
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: "SSL certificate expiring CRITICAL"
      description: "{{ $labels.instance }} cert expires in {{ $value | humanizeDuration }}"

  - alert: SSLCertExpired
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "SSL certificate has EXPIRED"
      description: "{{ $labels.instance }} SSL certificate is expired!"

Method 3: Kubernetes — cert-manager Health Checks

# Check all certificates across namespaces
kubectl get certificates --all-namespaces

# Watch for not-ready certificates
kubectl get certificates -A | grep -v "True"

# Inspect a specific certificate
kubectl describe certificate my-cert -n production

# Check events for renewal failures
kubectl get events -n production --field-selector reason=Failed

# Useful for alerting: cert-manager exports Prometheus metrics at :9402/metrics
# Key metric: certmanager_certificate_expiration_timestamp_seconds

Method 4: Infrastructure-as-Code (Terraform)

Track certificate expiry as Terraform data sources to catch drift between your IaC definition and the live certificate:

# Terraform — use tls_certificate data source to read live cert details
data "tls_certificate" "production" {
  url = "https://yourdomain.com"
}

output "cert_expiry" {
  value = data.tls_certificate.production.certificates[0].not_after
}

output "cert_issuer" {
  value = data.tls_certificate.production.certificates[0].issuer
}

# Fail plan if cert expires within 30 days
locals {
  expiry_date = timeadd(data.tls_certificate.production.certificates[0].not_after, "0s")
  days_remaining = ceil((parseint(local.expiry_date, 10) - parseint(timestamp(), 10)) / 86400)
}

Incident Response Runbook: Expired SSL Certificate

Confirm the incident

Run: openssl s_client -connect yourdomain.com:443 < /dev/null 2>&1 | openssl x509 -noout -dates. Confirm the certificate is expired or about to expire.

Identify the renewal method

Is this managed by Certbot? cert-manager? Cloud ACM? Find the renewal configuration immediately.

Emergency manual renewal (Certbot)

sudo certbot renew --force-renewal -d yourdomain.com then sudo systemctl reload nginx

Communicate status

Post to status page. Notify customer-facing teams. Internal SLA: resolve within 30 minutes of detection.

Post-incident review

Identify why monitoring did not catch this 14+ days in advance. Add or fix the monitoring gap before closing the ticket.

External monitoring as your safety net

Internal Prometheus and CI checks are great. But external monitoring from outside your network catches issues that internal tools miss: CDN-layer certificates, load balancer misconfigurations, and geo-specific failures. CertNotify provides this external check as a safety net — free for up to 3 domains.

Add external monitoring →