© 2018 Strange Loop
Site Reliability Engineering often has a slightly different set of priorities than the average product engineer. These sets of priorities were once explained by Mikey Dickerson of Google & USDS fame as a "Hierarchy of Reliability". This hierarchy provides a pyramid of things that can help an organization define reliability for itself, and begin discovering what future outage prevention would even work.
This talk will walk through real life examples of implementing each level of the Dickerson Pyramid (Monitoring, Incident Response, Postmortem, Testing & Releasing, Capacity Planning, Development, and Product) using examples from my time at three different companies: Google, Hillary for America and First Look Media.
Nat Welch has been writing software professionally for over a decade. He was an early SRE on Google Compute Engine and at Hillary for America, along with doing full stack engineering for a variety of startups. He currently lives in Brooklyn, NY and works as the Lead Site Reliability Engineer at First Look Media. He likes pickles, sushi, beer and data hoarding.