The PagerDuty Post-Mortem Handbook

Post-Mortems Are Necessary No major incident is ever truly resolved without a post-mortem. Post-mortems are a great way for development teams to identify and analyze elements of a project that were successful or unsuccessful. It’s a way to look back and review the incident in detail to determine exactly what went wrong, why it went wrong, and what can be done in the future to make sure it doesn’t happen again. Sharing Our Incident Response Process Reliability has always been one of the primary design considerations at PagerDuty. But what do we do when the unexpected happens and something does go wrong? It’s of the utmost importance that we are prepared and can A post-mortem can also be referred to as an after-action review, incident review, or follow-up review. While the name may be different, the process and goal is the same. get our systems back into full working order as quickly as possible. We pride ourselves on being able to quickly resolve issues that arise and keep our systems working within their SLA. We’ve worked very hard to accomplish this, and our incident response process is where it all begins. Our internal incident response documentation is something we’ve built up over the last few years as we’ve learned from our mistakes. It details the best practices of our process, from how to prepare new employees for on-call responsibilities, to how to handle major incidents, both in preparation and after-work. Few companies seem to talk about their internal processes for dealing with major incidents. It’s sometimes considered taboo to even mention the word “incident” in any sort of communication. We would like to change that. To that end, we’d like to share how we here at PagerDuty conduct post-mortems internally. It is our hope that others will use the documentation as a starting point to formalize their own processes. This guide provides information on what to do after a major incident and shares PagerDuty’s follow-up and after-action review procedures. Check out the rest of our incident response documentation to learn how we prepare for and handle incidents, as well as how we prep our teams to go on-call effectively. 3 First & Foremost, Create Response Roles Creating response roles for individuals on your team gives each person specific follow-up tasks to be accountable for. These are generally lightweight tasks that ensure information is organized and customers are followed-up with accordingly. Below are the five response roles we assign. Incident Commander An Incident Commander acts as the single source of truth of what is currently happening and helps drive major incidents to resolution. TASKS INCLUDE: • Create the post-mortem page from the template, and assign an owner to the post-mortem for the incident. • Send out an internal email to the relevant stakeholders explaining that we had an incident and provide a link to the post-mortem page. • Check on the progress of the post-mortem to ensure that it’s completed within the desired time frame. Deputy A Deputy is a direct support role for the Incident Commander. They support the Incident Commander so that the Incident Commander can focus on the incident at hand. 4 TASKS INCLUDE: • There are no steps for a Deputy after an incident is resolved, however, the Incident Commander may ask for your help with their steps.
Please complete the form to gain access to this content