The On-Call Survival Guide

Sharing Our Internal Response Process Our internal incident response documentation is something we’ve built up over the last few years as we’ve learned from our mistakes. It details the best practices of our process, from how to prepare new employees for on-call responsibilities, to how to handle major incidents, both in preparation and after-work. We’d like to share how we here at PagerDuty prepare our team members for going on-call. It is our hope that others will use the documentation as a starting point to formalize their own processes. In this guide, we’ll talk about what being on-call actually means, what on-call responsibilities entail (and don’t entail), and best practices for being on-call. What is “On-Call”? Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise for the system you are responsible for. For example, if you are on-call for a service at your organization, should any alarms be triggered for that service, you will receive an alert on your mobile device via email, phone call, push notification or SMS, providing you details on what’s wrong and how to fix it. You’re expected to take whatever actions are necessary to resolve the issue and return your service to a normal state. On-call responsibilities extend beyond normal office hours and if you are on-call, you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it sometimes can be), but this is what our customers go through, and is the exact problem that PagerDuty is trying to solve! PagerDuty exists to make on-call life less painful for everyone. 3 Responsibilities Knowing exactly what your responsibilities are can make being on-call much more painless. Below are responsibilities as they relate to each step of the incident management process. Prepare For peace of mind, it’s crucial that you’re prepared with everything you need before going on-call. Have your laptop and Internet with you (office, home, a MiFi, a phone with a tethering plan, a hotspot, etc). Have a way to charge your MiFi. Team alert escalation happens within 5 minutes. Be sure to set or stagger your notification timeouts accordingly. Make sure PagerDuty can bypass your “Do Not Disturb” settings Your environment should be set up and a current working copy of the necessary repos should be local and functioning. 4 Have your configured and tested environments on workstations. Ensure your credentials for third-party services are current. Understand how your organization handles serious incidents, as well as what the different roles and methods of communication are. Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc.
Please complete the form to gain access to this content