Data-Driven DevOps

Data-Driven DevOps IMPORTANT METRICS It takes time to build a data-driven culture, so where do you stare? Incident response is an essential part of keeping your business active, and is a good place to lay that foundation for your team. Here are four incident response metrics to get you started. Raw Incident Count When you know the number of incidents a team normally encounters, a spike or continuous upward trend in the incident count tells you either that team’s infrastructure has a weakness or their monitoring tools need to be recalibrated. A data-driven DevOps team has the tools and the agility to put an end to alert fatigue. As you add features and monitoring tools, incident count may rise. But you can lower real incidents per responder by filtering out low-quality alerts, building runbooks, and automating common fixes, the team prevents alert fatigue and maximizes the time it can spend tackling critical incidents and building new features. As with any metric, knowing the number is less important than knowing the context that gave you that number. It’s important to break your incident count down by team or service and drill into specific incidents to understand what is causing problems. Was that spike on Wednesday due to a failed deploy that caused issues across teams or just a flapping service on a low-severity service? Comparing incident counts across services and teams also helps you understand whether a particular incident load is better or worse than the organization average. Time to Acknowledgment Time to Acknowledgment (TTA) is a good way to measure individual performance. Team members may not always have control over the root cause of a particular incident, but they are always in control of how quickly they acknowledge and respond. Fast response time is a marker of operational readiness, and teams with the attitude and tools to respond faster tend to have the attitude and tools to recover faster. Operationally mature teams have high expectations for their team members’ TTA and hold themselves accountable with internal targets on response time. You can enforce a response time target with IT operations management software using an escalation timeout. If, for example, you decide that all incidents should be responded to within five minutes, you simply set your timeout to five minutes to make sure the next person in line is alerted if the timeout is triggered. Tracking your escalations will also give you valuable data about how your team is working together. Escalations For most organizations using IT operations management software, escalations are rare. They are a sign that either a responder wasn’t able to get to an incident in time or that he or she didn’t have the tools or skills to work on it. While escalation policies are a necessary and valuable part of incident management, teams should generally be trying to drive the number of escalations down. If you’re seeing a rising trend in escalations over time, you can make adjustments to your workflow and alerting protocols to ensure that alerts are being funneled to the people with the time and skills to address them. pagerduty.com 3 Data-Driven DevOps It should be noted that there are some situations in which an escalation will be part of standard operating practice. For example, you might have a NOC, first-tier support team or even an autoremediation tool that triages or escalates incoming incidents based on their content. In this case, you’ll want to track what types of alerts should be escalated and what normal numbers should look like for those alerts. Mean Time to Resolution Mean Time to Resolution (MTTR) is the highest standard you can use to measure your team. How long does it take your team to resolve an incident? Every organization has a different baseline for MTTR. Complexity of infrastructure, organization of responsibility, even the industry in which the organization operates can all contribute to different norms. But downtime is expensive, both in loss of revenue and customer trust, and it’s important to track MTTR to make sure that your team is up to the challenges of a major incident. HOW TO BUILD A DATA-DRIVEN CULTURE Now that you have some basic metrics to drive your team’s performance, the question is how to build a culture around them. There aren’t simple answers to this question, and you will know best how to guide your team through this change. There are, however, a few principles of datadriven DevOps culture to keep in mind. Relate the metrics to both your specific business goals and the team’s role in achieving them. The goal is to get your engineers to see themselves as generating value for your customers, not just “keeping the lights on” for the company. Mean Time to Resolution is the ultimate customerfacing metric, but it can be difficult for teams to take sole responsibility for the results you see there. But combining MTTR with MTTA should give you a clearer picture of how your team is contributing to customer satisfaction. Once everyone is working with the same customeroriented goals in mind, you’ll have established a common reference for success as you tackle new challenges. Once everyone is working with the same customer-oriented goals in mind, you’ll have established a common reference for success as you tackle new challenges. pagerduty.com 4
Please complete the form to gain access to this content