Best Practices for Postmortems: A guide

Software bugs can cause major issues, may take significant resources to fix, and even lead to fatalities. It’s important to review and reflect on your overall project and your current processes so that you can continuously improve and avoid incidents in the future.

A postmortem is completed at the end of a project to help a team reflect on the entire software development lifecycle (SDLC). Together, the team looks at its successes and failures and analyzes what it can learn and improve on. It’s a collaborative process that encourages the whole team to contribute and ultimately builds trust in your organization.

Postmortems can help you identify the root cause of a problem and identify an action plan to mitigate future problems. They can also help identify patterns between seemingly unrelated incidents.

In this guide, you’ll learn how to conduct an effective postmortem. In the end, you’ll be equipped with templates and other resources that can help you conduct an effective postmortem with ease.

Why Should You Implement a Postmortem?

Every time you conduct a postmortem, you significantly reduce the chances of that bug happening again. It also helps you safeguard against future code issues.

Postmortems aren’t always conducted, but should be completed if you had a bug in your system that should have been caught earlier, or if you experienced a bug that had the potential to (or did) cause significant defects in your system.

Implementing a postmortem process can help you catch issues before they happen while documenting and discussing other ways your team can grow.

Best Practices to Improve Your Postmortem Process

Before implementing a postmortem process, you need to create a postmortem template that contains key details like:

who/which teams own the postmortem (and who will do the analysis)
when the incident happened
lessons learned
a timeline of the incident
action items

Building a good template is a key part of guaranteeing your team will learn and grow from the process. It ensures you safeguard the process, and that each postmortem answers the most common questions. It also ensures that postmortems written by different team members don’t vary in substance significantly and helps guarantee your root cause analysis will be crystal clear.

Following are a few postmortem best practices you should keep in mind when creating your template and postmortem process:

Postmortems Aren’t for Blaming

One of the most important and significant requirements of a successful postmortem is not to blame people or teams. Learning from the process will be difficult when team members are defensive.

The process should focus on fact-finding and figuring out HOW something happened, not WHO was responsible. Remember that the goal is to put up safeguards to ensure you don’t see issues like this again.

You should also be mindful of team members who explained they did their part of the release in a way that’s always been done. Postmortems help you constantly challenge your processes in order to improve them. They let the whole team look at problems in a new light and gain context around each decision in order to help identify patterns that can lead to a new way of doing things.

Involve the Author

A postmortem answers simple questions and notes where processes can be improved. Make sure each postmortem is written by people who were involved in the incident. If you have someone writing it that wasn’t involved in the process, you can lose important context and details.

Things like the ordering of events or the steps that were taken by multiple parties to resolve incidents can get mixed up and become confusing to learn from. The best people to explain and remediate are the individuals who were directly involved in the incident.

Postmortems Need Actionable Steps

If at the end of a postmortem, you have actions like improving code quality, and ensuring bad releases don’t go to prod, you’re not diving deep enough into why the accident happened. In order to gain the meaningful insight you need to create actionable steps. An actionable step is one in which you address the issue head-on and start making changes immediately.

A good postmortem will include comments like:

The code quality in Omega codebase is making it incredibly difficult for developers to know their fix works. You need a week to write more robust unit tests.
The automated checks before a release keep showing false positives and aren’t correctly highlighting when you have real failures in the code.

These comments can be actioned, and prove you’ve done the correct level of investigation.

For instance, Google has one action item from each of its core principles of incident detection: detection, prevention, and mitigation in an attempt to ensure each incident won’t occur again.

Seek Feedback

It’s important to determine if your postmortems are doing their job of helping to identify issues and prevent and mitigate issues in the future. If the same issues are recurring, the postmortem format you’re using needs to be updated.

You should ask for (and provide) feedback on the postmortems your team writes. It’s possible that there’s missing knowledge or tools that you’re not aware of that can help improve your team and its postmortem process. You should continually seek feedback so that your processes continue to grow and improve and so that your team gets better and better at catching mistakes before they happen.

Who you should seek feedback from is an important question!

Feedback is often sought from the team responsible for the service, gleaning differing experiences from junior to principal developers. A lot of big software teams (Microsoft, Google, etc) open source their postmortems, but it’s normally not for feedback on how to improve their systems, but for good communication to assure their customers this likely won’t happen again.

Postmortem examples from around the world!

In order to better understand how the postmortem process works, let’s look at some companies that have implemented them well.

Amazon

In February of 2017, the Amazon S3 team was debugging a minor issue, and in that process issued a command to remove a small number of servers. However, the command was issued with a typo and it removed a larger set of servers than intended. Because these servers support critical systems, the dependent systems also required a full restart in order to function properly.

This caused a vast cascading failure since Amazon’s own services like EC2 and EBS rely on the servers. The failure ended up affecting hundreds of other companies in the process.

The incident was resolved roughly four hours later.

In this instance, Amazon found that the tool responsible for the removal of servers wasn’t strict enough. It allowed too many servers to be easily removed and so the company updated its processes to ensure servers would be removed more slowly in the future. Amazon also added safeguards to prevent servers from being removed if it would take any system/subsystem below its minimum required capacity level.

This limits the chances of this incident happening again, and ensures better safeguards for its customers.

keepthescore

keepthescore experienced an issue when it accidentally deleted a production server. Unlike Amazon, keepthescore is much smaller and the founder is also the developer.

Thankfully, it utilizes a managed database from DigitalOcean that offers daily backups. However, there was a period of about seven hours during which the data was permanently deleted.

The database call it did was hard-coded to run only on a local machine (not the production machine) and through human error had been run on the wrong server.

Through its postmortem process, keepthescore was able to classify the database call that deleted the table as “too dangerous” since it wasn’t able to test it safely. It then made the technical decision to remove the code completely and test its backup system’s speed of recovery to prevent the same problem in the future.

If you want to look at even more postmortems from other companies like Google, GitHub, Linux, and Spotify, you can explore other examples below!

GitHub - danluu/post-mortems: A collection of postmortems. Sorry for the delay in merging PRs!

A collection of postmortems. Sorry for the delay in merging PRs! - GitHub - danluu/post-mortems: A collection of postmortems. Sorry for the delay in merging PRs!

GitHubdanluu

Conclusion

In this article, you’ve learned about what a postmortem is and why you should conduct one. You also learned about best practices you should incorporate into your postmortem process.

A good postmortem will feel collaborative and build team unity while ensuring the continuous improvement of your organization. Take a look at other postmortem examples and then get started implementing your own postmortem process today.

(A big thank you to Kealan Parr for his contribution to this article)

Best Practices for Postmortems: A guide

Why Should You Implement a Postmortem?

Best Practices to Improve Your Postmortem Process

Postmortems Aren’t for Blaming

Involve the Author

Postmortems Need Actionable Steps

Seek Feedback

Postmortem examples from around the world!

Amazon

keepthescore

Conclusion

Contents

Newsletter

Handcrafted Related Posts

1979, a nuclear accident and SRE

Deep diving into the 'Normal accident' theory by Charles Perrow, and what it means for SREs

How to Manage High Cardinality Metrics in Prometheus

A comprehensive guide on understanding high cardinality Prometheus metrics, proven ways to find high cardinality metrics and manage them.

GCP Managed Service For Prometheus vs. Levitate

A detailed comparison of Levitate and Google Managed Prometheus - Cost, Scale and Ease of Use