Engineering teams' informal failure banks
Every engineering team has an informal “failure bank” distributed across different engineers’ memories. Engineer A knows all the edge cases of different configs. Engineer B has fought many battles with Kafka rebalancing. Engineer C knows all the quirks with Kubernetes autoscaling. This uneven distribution of troubleshooting effectiveness is both a strength and a vulnerability.
Last week an engineer told me how he saved his team from a multi-hour outage. Kubernetes pods were stuck in ‘Pending’ with no error messages. Dashboards showed normal resource utilization and a growing backlog of pending workloads. His team had no idea what was going on. He jumps in and noticed that pods were waiting for an IP address, an issue he’d seen before. Crisis averted.
The failure bank gets built incident by incident. The first time an engineer encounters an issue, they struggle through hours of investigation, correlating metrics and logs across systems to find the root cause. Once they solve it, they become the informal expert on that particular failure mode. This creates a powerful feedback loop where the team instinctively pulls in “the Kafka person” or “the networking guru” when similar symptoms appear. Over time, this specialization deepens as engineers build pattern recognition for their areas of battle-tested expertise.
I don’t think I’ve seen an effective way to ‘force’ documentation. Tools like Pagerduty hardly get updated, runbooks/playbooks even less so. The two patterns I’ve seen work best are:
- Organizing a regular retro style meeting, where engineers present a recent post mortem to others for feedback. These docs are stored in a repository that is searchable by others.
- Rely on organic conversation, Slack is great for this. Ensure people have conversations around common alerts in public channels, so it’s easy for others to search and find lessons learned.
After seeing these patterns repeat across teams, I think AI agents have real potential here. They could become the unified failure bank that teams need — observing incidents across systems, learning from each investigation, and making that collective knowledge searchable. The hard part isn’t the technology. It’s replicating the pattern recognition that experienced engineers build over years of firefighting.