On-call burnout is mostly an alerting design problem. When every spike in CPU, every 5xx, and every pod restart sends a page at 2am, engineers learn to ignore alerts. Then the important ones get ignored too.
Alert on symptoms, not causes
A high CPU alert tells you something might be wrong. A high error rate alert tells you users are being affected right now. Page on the second. Investigate the first only when it correlates with user impact.
The four golden signals
- Latency — Alert when p99 exceeds your SLO threshold, not the average.
- Errors — Alert when the error rate exceeds a percentage of total traffic.
- Traffic — Use as context, not as an alert trigger by itself.
- Saturation — Alert when saturation is on a trajectory to cause impact.
Writing good alert rules
groups:
- name: api_slo
rules:
- alert: HighErrorRate
expr: |
(
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
) > 0.01
for: 5m
labels:
severity: page
annotations:
summary: "{{ $labels.service }} error rate {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
The for clause is your friend
Always set a for duration. Without it, a single scrape with a bad value fires the alert. Most transient spikes resolve themselves. A 5-minute for filter eliminates the vast majority of false positives with minimal impact on detection time for real incidents.
Checklist for every new alert
- Does this alert mean a user is being impacted right now?
- Is there a runbook link in the annotations?
- Is the
forduration long enough to filter transient spikes? - Is the threshold based on actual SLO data, not gut feel?