Kill the flies before fighting fires

10 Sep 2025

Kill the flies. Then you’ll finally have time for the fires.

Hot take: most teams don’t have an “incident response” problem. They have a noise economy problem. We celebrate the Friday-night P0 save and ignore the 300 pages that ate someone’s entire week. Half of those “real” alerts auto-resolve. That’s not resilience, it’s Stockholm syndrome.

If your on-call spends the week acknowledging PagerDuty, you’re not improving MTTR - you’re burning the team’s mean thinking time.

What I keep seeing:

Coverage is “good.” Volume is brutal.
Alerts added after postmortems stay forever, even if they’re low-value because… “what if it happens again?”
The distribution is a triangle: a few P0s/P1s, a sea of P3s.
P3s get ignored on busy weeks, so they metastasize into next week’s P2s.
Root-causing is tough: categories are fuzzy, runbooks are half-baked, and by the time an engineer looks, the system has healed.

The counterintuitive part: the P0s are not the real enemy. They’re where engineers actually learn the system (albeit a bit stressfully!).

The enemy is the swarm that keeps you from ever getting there.

Seriously, take the time to kill the flies.

Kill the flies before fighting fires

More posts

AI agent learning beats demo flashiness

The coding interview needs to die

Hallucination rate is the wrong question