Every SRE team attempting to manage, mitigate or eliminate the risks facing their system will encounter two fundamental problems:
- As humans our intuitive judgement about risk is unreliable.
- The work required to address all potential risks far outstrips our available time and resources.
The CRE team (Customer Reliability Engineering—a group of Google SREs who partner with cloud customers to implement SRE practices in their application and across the cloud provider/customer relationship) battles these challenges every day in our interactions with customers. We have drawn on Google’s deep experience managing reliable systems, and the broader field of risk management techniques to develop a process that allows us to communicate an objective ranking of risks and their expected cost to a system. This ranking and the associated cost data can then be used as an input to team and business decision making.
This talk will cover the development of our process, explain how anyone can apply it to any system today and demonstrate how the resulting ranking and costs provide objective, consistent data which can take the tension and subjectivity out of often tense discussions around work priorities and focus (e.g. more features or more reliability?).