Strange Loop

Incident insights from NASA, NTSB, and the CDC

All complex systems eventually fail. With that inevitability, understanding recovery is paramount. Beyond the investment to keep a system running, we must know how to effectively recover upon failure, and ensure we don't encounter the same failure twice. The stakes are high: in a connected future, one with self-driving cars and fully-automated economies, outages won't only damage customer trust and the bottom line, but could cost lives.

Luckily, software isn't the only industry that deals with the failure of complex systems. Instead of reinventing the wheel, we should take a cross-disciplinary approach and draw inspiration from decades of experience in other fields. Lessons from industries dealing with similar challenges abound: medicine with surgery, transportation with air travel, and aerospace with rockets.

In this talk, I'll share my research into the incident handling and postmortem practices of other fields, surfacing the lessons we can take away. Questions we'll answer include: what has the NTSB learned from investigating 140,000 transport accidents? How does the CDC prevent epidemics from becoming pandemics in the midst of chaos? What can we learn from NASA's postmortem culture?

Still in its early days, SRE has figured out incident management and analysis through trial-and-error and tribal knowledge. As the field matures, and the world relies more heavily on our systems, we can craft best practices by learning from others rather than from inevitable catastrophe.

Emil Stolarsky

Emil Stolarsky


Emil is a production engineer at Shopify where he works on scriptable load balancers, performance, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's fighting his fear of heights in a nearby rock climbing gym.