Prometheus Alerting That Does Not Burn You Out

11 April 2026 249 words 2 min read

Author

Guruprasad Raikar

Platform Engineer building developer tooling, CI/CD systems, and cloud infrastructure that help teams ship faster and more reliably.

Email / GitHub / LinkedIn

On this page

On-call burnout is mostly an alerting design problem. When every spike in CPU, every 5xx, and every pod restart sends a page at 2am, engineers learn to ignore alerts. Then the important ones get ignored too.

Alert on symptoms, not causes

A high CPU alert tells you something might be wrong. A high error rate alert tells you users are being affected right now. Page on the second. Investigate the first only when it correlates with user impact.

The four golden signals

Latency — Alert when p99 exceeds your SLO threshold, not the average.
Errors — Alert when the error rate exceeds a percentage of total traffic.
Traffic — Use as context, not as an alert trigger by itself.
Saturation — Alert when saturation is on a trajectory to cause impact.

Writing good alert rules

groups:
  - name: api_slo
    rules:
      - alert: HighErrorRate
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[5m])
            /
            rate(http_requests_total[5m])
          ) > 0.01
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "{{ $labels.service }} error rate {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

The `for` clause is your friend

Always set a for duration. Without it, a single scrape with a bad value fires the alert. Most transient spikes resolve themselves. A 5-minute for filter eliminates the vast majority of false positives with minimal impact on detection time for real incidents.

Checklist for every new alert

Does this alert mean a user is being impacted right now?
Is there a runbook link in the annotations?
Is the for duration long enough to filter transient spikes?
Is the threshold based on actual SLO data, not gut feel?

Share this post

Kubernetes Resource Requests and Limits: A Practical Guide 24 March 2026

Alert on symptoms, not causes

The four golden signals

Writing good alert rules

The for clause is your friend

Checklist for every new alert

Related Articles

The `for` clause is your friend