Engineering teams' informal failure banks

Every engineering team has an informal “failure bank” distributed across different engineers’ memories. Engineer A knows all the edge cases of different configs. Engineer B has fought many battles with Kafka rebalancing. Engineer C knows all the quirks with Kubernetes autoscaling. This uneven distribution of troubleshooting effectiveness is both a strength and a vulnerability.

Last week an engineer told me how he saved his team from a multi-hour outage. Kubernetes pods were stuck in ‘Pending’ with no error messages. Dashboards showed normal resource utilization and a growing backlog of pending workloads. His team had no idea what was going on. He jumps in and noticed that pods were waiting for an IP address, an issue he’d seen before. Crisis averted.

The failure bank gets built incident by incident. The first time an engineer encounters an issue, they struggle through hours of investigation, correlating metrics and logs across systems to find the root cause. Once they solve it, they become the informal expert on that particular failure mode. This creates a powerful feedback loop where the team instinctively pulls in “the Kafka person” or “the networking guru” when similar symptoms appear. Over time, this specialization deepens as engineers build pattern recognition for their areas of battle-tested expertise.

I don’t think I’ve seen an effective way to ‘force’ documentation. Tools like Pagerduty hardly get updated, runbooks/playbooks even less so. The two patterns I’ve seen work best are:

After seeing these patterns repeat across teams, I think AI agents have real potential here. They could become the unified failure bank that teams need — observing incidents across systems, learning from each investigation, and making that collective knowledge searchable. The hard part isn’t the technology. It’s replicating the pattern recognition that experienced engineers build over years of firefighting.

More posts

AI agent learning beats demo flashiness

Hallucination rate is the wrong question

Coding agents vs SRE agents are different beasts