Alerting Done Right: From Alert Fatigue to Actionable Notifications
Too many alerts are just as bad as none at all. We show how to build an alerting system that only fires when it truly matters.
The Alert Fatigue Problem
When your on-call engineer receives 50 alerts a day, of which 48 are false positives, the real two will be ignored as well. Alert fatigue is one of the greatest risks to system reliability.
Principles for Good Alerts
- Symptom-based: Alert on symptoms (high error rate, slow response times), not causes (high CPU). CPU at 90% without impact is not an alert.
- Actionable: Every alert must have a clear action to take. If nobody can do anything about it, it is not an alert — it is a log entry.
- Severity Levels: Distinguish between P1 (wake up now) and P3 (look at it tomorrow). Not everything is a pager alert.
SLO-Based Alerting
The most modern approach: define Service Level Objectives (SLOs) and alert when the error budget is being consumed.
- Error Budget: With an SLO of 99.9%, you have approximately 43 minutes of downtime budget per month.
- Burn Rate: Alert when the budget is being consumed faster than expected — not on every individual error.
- Multi-Window: A combination of fast (5 min) and slow (1 h) windows drastically reduces false positives.
Practical Tips
Conduct regular alert reviews. Delete alerts that nobody responds to. Document runbooks for every remaining alert. And: respect the on-call rotation — those who have not slept make mistakes.
Questions About This Topic?
We are happy to advise you on the technologies and solutions described in this article.
Get in Touch