About This Talk
While developing software, bugs and mistakes are inevitable. Come to hear how we can improve the approaches we often take as software developers to work better with one another in heated moments of failure and the aftermath of incidents. Through better interactions we can build better teams and create better services.
In my career I have worked in a blameless post-mortem and a blame-full post mortem environment, across a variety of projects ranging from individual python libraries, to core infrastructure for a cloud. I am excited to share how I think not assigning blame when things go wrong results in a better team and a better product.
Proposed Talk Outline
Intro (1 minute)
- Describe how failure of software is inevitable
- Individuals should feel prepared to handle failure.
- Individuals should feel they can learn from their failure not be punished for it
- Focusing on this turns a software failure into a way we grow as a team and individuals.
We can’t choose to have bugs. We can choose how we react. (5 minutes)
- Set up a typical failure. Present examples of common failure scenarios.
- Describe the process of reacting: Identify, then fix, then learn from.
- Every bug or outage can present us a way to learn
- It is easy to blame someone.
- It is harder to figure out how we can prevent problems from recurring.
Exploring why we shouldn’t blame. (7 minutes)
- Excusing failure as ‘human error’ stops the drive to a tractable solution. Excuses the system from having shortcomings (this issue will recur, system is still broken)
- Punishing a team member isn’t necessary in this instance. Trust that they already feel bad enough for the bug/outage.
- Blame doesn’t work as a ‘deterrent’ but instead has other effects.
- Make comparisons to other industries and how they have learned similar things, such as healthcare review panels and removing blame to help drive improvements.
- Discourages transparency. Individuals are encouraged to hide/downplay issues to preserve self image. If you blame, you lose the chance to have a conversation about it and to learn as a team.
Explain the goals/non-goals of a post mortem. (3 minutes)
- Object should always be to avoid repeated errors, not to punish/fault someone.
- Move on to how we can take most all of those and see a blameless way to view that issue.
- Explain how, in this environment where we don’t fear blame, this improves our interactions, drives us to a proper solution.
- Enables team members to instead feel free to admit to mistakes
- Blameless post mortems create is a healthier team, healthier culture, healthier product.
Writing the post mortem. (12 minutes)
- Quick overview of postmortem process.
- Provide an example format for postmortem documents.
- Discuss what a blameless statement looks like
- Discuss a couple of common industry techniques used to find a root cause.
- Work through how to find the root cause and keeping this blameless using one of the above methods and when to consider a root cause found.
- Discuss post-mortem fields related to understanding/addressing the root cause.
- Things that went well
- Things that went poorly
- Ways in which the team was lucky
- Action items to prevent this scenario from recurring
Discuss takeaways. (2 minutes)
- We should identify the cause, not the causee.
- Choose the best method for you and your team to find the root cause, but avoid shallow RC and drive for detailed, preventable ones.
- Always assume good will from team.
Chris is a developer at Google, splitting time between Python and Node.js communities with a focus on improving the Google Cloud Client Libraries. He has spent his career working on developer tooling and libraries. In Chris’ spare time he races motorcycles, volunteers with a motorcycle training organization, hikes, and explores the Seattle brewing scene.