Prometheus Alerting That Does Not Burn You Out

249 words 2 min read

Author

Guruprasad Raikar

Platform Engineer building developer tooling, CI/CD systems, and cloud infrastructure that help teams ship faster and more reliably.

On this page

On-call burnout is mostly an alerting design problem. When every spike in CPU, every 5xx, and every pod restart sends a page at 2am, engineers learn to ignore alerts. Then the important ones get ignored too.

Alert on symptoms, not causes

A high CPU alert tells you something might be wrong. A high error rate alert tells you users are being affected right now. Page on the second. Investigate the first only when it correlates with user impact.

The four golden signals

  • Latency — Alert when p99 exceeds your SLO threshold, not the average.
  • Errors — Alert when the error rate exceeds a percentage of total traffic.
  • Traffic — Use as context, not as an alert trigger by itself.
  • Saturation — Alert when saturation is on a trajectory to cause impact.

Writing good alert rules

groups:
  - name: api_slo
    rules:
      - alert: HighErrorRate
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[5m])
            /
            rate(http_requests_total[5m])
          ) > 0.01
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "{{ $labels.service }} error rate {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

The for clause is your friend

Always set a for duration. Without it, a single scrape with a bad value fires the alert. Most transient spikes resolve themselves. A 5-minute for filter eliminates the vast majority of false positives with minimal impact on detection time for real incidents.

Checklist for every new alert

  • Does this alert mean a user is being impacted right now?
  • Is there a runbook link in the annotations?
  • Is the for duration long enough to filter transient spikes?
  • Is the threshold based on actual SLO data, not gut feel?

Share this post