Kill the flies before fighting fires
Kill the flies. Then you’ll finally have time for the fires.
Hot take: most teams don’t have an “incident response” problem. They have a noise economy problem. We celebrate the Friday-night P0 save and ignore the 300 pages that ate someone’s entire week. Half of those “real” alerts auto-resolve. That’s not resilience, it’s Stockholm syndrome.
If your on-call spends the week acknowledging PagerDuty, you’re not improving MTTR - you’re burning the team’s mean thinking time.
What I keep seeing:
- Coverage is “good.” Volume is brutal.
- Alerts added after postmortems stay forever, even if they’re low-value because… “what if it happens again?”
- The distribution is a triangle: a few P0s/P1s, a sea of P3s.
- P3s get ignored on busy weeks, so they metastasize into next week’s P2s.
- Root-causing is tough: categories are fuzzy, runbooks are half-baked, and by the time an engineer looks, the system has healed.
The counterintuitive part: the P0s are not the real enemy. They’re where engineers actually learn the system (albeit a bit stressfully!).
The enemy is the swarm that keeps you from ever getting there.
Seriously, take the time to kill the flies.