Everybody in tech world understands CI as continuous integration, but the other and maybe even more important meaning of this abbreviation is continuous improvement. It is easy to say that we encourage progress and learning, but as people get overwhelmed with daily tasks, these more abstract goals are pushed to background, unless you are super-disciplined. So how to systematically ensure that people continuously learn? How to ensure that the system that we're building improves continuously? The best opportunity for learning is when a mistake happens. That gives you very specific trigger to learn and also everybody understands that the learning process should be prioritized because nobody wants to repeat the same mistake twice. That would be pretty embarrassing.
Making mistakes is OK
People often say that making mistakes is OK, especially in agile environment where everything happens in fast-pace and the systems that are being developed have no mission or business critical impact on the end users. My opinion?
Making mistakes is not OK, I hate when our system has bugs or when we mess something up, I'm embarrassed because that's clear sign of us not doing our job good enough. On the other hand, mistakes are almost inevitable (besides formally verified software but there you can still make a mistake or leave out something important in the specification against which the software is verified). So how to escape this dilemma?
Making mistakes is OK, if you learn from them, and never make them again.
That sounds much better, it suddenly covers the saying "fool me once, shame on you, fool me twice, shame on me". If you learn from your mistakes, they did not happen for nothing and there is at least some positive impact. Well, that's not enough.
Making mistakes is OK, if everybody learns from them, and never make them again.
There are many people within a company, if you'd be the only one to learn from the mistake, it could still reoccur for each employee. And with growing companies it's even worse, the future people would definitely lack this learning. So the issue could be happening even more and more! Therefore it's really important that it's somehow ensured that everybody, including employees joining the company in the future, learn from the mistake that happened now. However we're not done.
Making mistakes is OK, if everybody learns from them, and never makes similar mistakes again.
This is another generalization, you should not stop on the boundary of the issue that happened. You should look around and see if there are other similar cases that could suffer from similar symptoms. And cover them as well.
Alright, could we get any better than this? For most companies, this is already pretty good statement and it's great if there are processes in place that encourage and ensure this. There is one more step where you can take this as far as I know.
Limiting the scope of learning to your own team or company is too restricting. Actually, the most interesting source of inspiration are failures of others. Many companies publish them online and you can try applying lessons from them on your own context. However that's what I'd call continuous improvement level master 🧙. Are you aware of other levels above this? Let me know in comments!
Culture of postmortems
So how to set up an environment where everybody learns from mistakes and never makes similar mistakes again? In our case, we established a postmortem process. In the beginning, start doing them formally only for critical issues. When you get into the mindset, you'll find yourself suddenly conducting them in your head even for much smaller issues. The task to create a postmortem is often assigned to the person who was involved in the problem the most. It is however collaborative effort to complete it, all stakeholders should be involved and the more people the better.
How should it look like? Over time, we ended up using the following structure consisting of 4 sections:
- 💥 Problem - Objective description of what happened, who was affected, for how long, if there was any impact on existing data or operations of the system. Note that this part is rather descriptive and covers only what happened, not why.
- 🛠️ Action - In most cases, when an issue happens, we resolve it quickly with some hotfix, revert, configuration update or other action that immediately helps. Whatever the action was, describe it in this section.
- 🕵️ Causes - The most important part of whole postmortem. Try to look at the issue from all angles, not only from code perspective but also evaluate processes, monitoring, alerting, communication, everything. And try to identify all things that could have prevented the issue or at least contributed towards faster detection or resolution. A must-see is the following overview of how to think about root cause analysis by NASA: https://des.wa.gov/services/risk-management/about-risk-management/enterprise-risk-management/root-cause-analysis
- ✔️ Solutions - For each root cause, try to propose some solution or at least improvement. For the suggestions that are feasible (e.g. solution for outage of Azure would be multi-cloud deployment but that would not be feasible at the moment), create issues, assign them and ensure that they're prioritized. In this section, describe and promise what we will do in order to prevent the issue from reoccurring.
Note that the solutions should be really strong ask yourself the following questions:
- What if whole team changes?
- What would happen differently in future?
- What if everybody has amnesia?
- What if nobody reads the postmortem?
- Would I communicate this to clients?
- Would I be happy if other company communicated this to me?
Let's get practical
So how do we conduct postmortems in Mews? The process is rather straightforward. When an issue with priority Blocker is resolved, our YouTrack workflow automatically opens a followup issue to complete a postmortem with two-week deadline. If the original issue is missing, we create the postmortem issue manually.
Then, we have a GitHub repository where we store postmortems. For every postmortem, author opens a new pull-request, stakeholders review it, comment on it and suggest solutions. We fully use the PR workflow, rejections and approvals etc. During this process, after everybody agrees on solutions, followup issues with deadlines are created for each feasible solution. Last but not least, we publish the postmortem internally and sometimes externally on our status page.
That's it, nothing complicated. And since blockers keep happening from time to time, it is guaranteed, that we get a continuous stream of opportunities to learn and improve 🚀.
For more engineering insights shared by Mews tech team: