Start doing checklists! Stop doing failure mode analysis. Failure mode analysis does not work.
- Checklists work.
- Checklists scale to hundreds of engineers.
- Checklists have a long history in fields like aviation and medicine.
My new book tells the story of migrating a 450 million user website to Azure, and shares the true story of how we almost failed to launch. I share the 76 point checklist that saved the launch, and can help any team build reliable service using cloud technology.
Failure Mode and Effects Analysis
Failure Mode and Effects Analysis is a process of looking across the entire system at all the components, brainstorming all the possible failures, and then scoring the failures to separate the most impactful events from the minor annoyances. Sounds like a great process. There are three reasons it does not work.
- Teams are proud. They do not want to analyze their services for failure.
- Teams are busy. Failure mode analysis seems like extra homework.
- Teams do not trust management, and are afraid of being evaluated.
Said another way Failure Mode Analysis forces teams to think in very broad terms, asks them to re-evaluate their software, and guess the most probable, critical failures. Changing the way you think is hard work, and it is a change that will not happen overnight.
Checklists are a better way to create reliable systems. Why do checklists work?
- Teams accept simply stated good ideas.
- Teams can react, accept or reject, specific guidance
- It takes a small amount of effort to modify and improve an explicit list
- Checklists make quality standards explicit
For example, a checklist item might state that Deployment of services should not impact availability or up-time. This one checklist item will help teams think through their software deployment process and the possible problems with deploying software. Once this checklist item is accepted teams take it as an explicit quality standard.
Said another way Checklists are specific which makes them easy to approach and makes self-evaluation easy. Easy self-evaluation enables teams to learn quickly.
Checklist For Reliable Cloud Services
A good battle tested checklist is worth it weight in gold. The last three chapters of this book act as a reference guide and include a 93 point checklist used to move a global service to the public cloud (76 -> 93 !? we added a few items). The checklist covers four main areas
- Pre-release: things to check before every release
- Deployment: capabilities needed as part of every release
- Monitoring & Alerting: what telemetry to collect, measure, and alert on
- Mitigation: how to make an unstable service stable
The book provides the rational for each checklist item, and show the hidden tips and tricks to implement the checklist items. Any team can take these checklist items to master large scale, cloud-native services.
Rolling out Checklists to Hundreds of Engineers
Many people ask how can we scale DevOps? It turns out checklists work pretty well for scaling practices across the organization. When checklist items are done well they stand on their own and do not require lengthy explanations or supporting documentation. This enables checklists to be shared through the entire organization.
Its best to roll out checklists in successive waves.
- First execute on the most important 5-10 items
- Second, kick off centralize efforts for the shared, core 5-10 items
- Third teams schedule fault injections to test checklist adoption
- Fourth learn and improve as a result of fault injection tests
- Fifth implement 90% of the remaining checklist items
Progress not Perfection
There is one catch. Tops down mandates of checklist items will not work. Leaders need to express a tone of progress not perfection when rolling out the checklist. Checklists are a tool to be leverage for improvement. As teams mature they adopt additional items and make changes to suit the needs of their business.